Slow history load with some values of HISTSIZE

2024-02-03 Thread Casey Johnson
Configuration Information [Automatically generated, do not change]:
Machine: x86_64
OS: linux-gnu
Compiler: gcc
Compilation CFLAGS: -g -O2 -flto=auto -ffat-lto-objects -flto=auto 
-ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security 
-Wall
uname output: Linux maxwell 5.19.0-50-generic #50-Ubuntu SMP PREEMPT_DYNAMIC 
Mon Jul 10 18:24:29 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Machine Type: x86_64-pc-linux-gnu

Bash Version: 5.1
Patch Level: 16
Release Status: release

Description:

  The current implementation of the command history takes a long time to 
load
  a large HISTFILE when history is stifled.  This is because there is a
  memmove() for every line of HISTFILE once HISTSIZE lines have been loaded.
  If N is the number of lines in HISTFILE then the cost of the memmoves ends
  up being something like (N - HISTSIZE) * HISTSIZE, peaking when HISTSIZE 
is
  roughly N/2.

In my case, the history file is over 300,000 lines.  If the history is
unstifled, it takes about 0.1 seconds to load.  With HISTSIZE around 200,000
it takes over 10 seconds to load, i.e., roughly 100 times as long.

Repeat-By:

Create a text file 'alt-history.txt' with a few hundred thousand lines, say
300k.  The content doesn't matter much.  For example, you can just cat
several copies of your actual history file to make one large file.

In a clean shell, execute:
HISTFILE=alt-history.txt
HISTSIZE=15
history -r
and then observe how long the last command runs before returning.

For comparison, in another clean shell, execute the same commands, with
unstifled history:
HISTFILE=alt-history.txt
HISTSIZE=-1
history -r
and then observe how long this takes.

If you repeat with larger and larger history files, you will observe the
runtime's quadratic growth.

Fix:

My proposed fix consists of a couple of components, which are included in
the patched history.c that is attached below:

 1. Currently, history reallocation is amortized by requesting 50
(DEFAULT_HISTORY_GROW_SIZE) extra slots at once.  We can do something
similar with a stifled history: allocate extra slots when HISTSIZE has
been reached.  Each time a new line is added, simply free the first
(oldest) entry and increment 'the_history' so the last entry is now
available.  Slide the window by one each time and only do a memmove()
when the extra slots are all gone.  Even allocating only 50 extra slots
reduces the memmove() work by a factor of 50, which brings the time down
to friendly territory.

 2. Vary the number of extra slots allocated as the history gets longer.  
The
current value of 50 is adequate when the history is small, but for a 
long
history it would be good to increase it so the work (both memmove() and
realloc()) amortizes better.  For example, a value of 1024 would only
preallocate 8KB on a 64-bit machine and 4KB on a 32-bit machine.  If
memory is a concern for some, this value could be made a configure-time
constant.  A better, more universal, approach would be to dynamically
choose the number of extra slots based on the length (N) of the history.
For example, if the extra slots are roughly equal to sqrt(N), with a
minimum of 50, the memmove() work becomes roughly N*log(N) instead of
N^2.  This works well for small values, but for absurdly large ones, 
too.

I include below a patched history.c that implements both of these.  The
array management is all handled by new functions I called advance_history()
and hist_shift_resize().  Each time a reallocation happens, it grows by
roughly 2*sqrt(N).

The tarball also contains a second version of the file, which includes some
error checking code and a realloc() if HISTSIZE is substantially reduced.
Those pieces may be worth consideration, but I feel less strongly about
them.  I based these commits on git commit
f3b6bd19457e260b65d11f2712ec3da56cef463f from Github.

In making these patches, I have endeavored to maintain the existing style
and functionality, touching as few functions as possible.  Thank you for
your consideration and I welcome any feedback you may have.

Casey Johnson


-- BEGIN base64-encoded bash-history.tar.gz --
H4sIA+08a3fbNpb9Gv0KND3TSLb8kPPoNI7d4yRyol3XzrGdJtlORkOLkMSEIlWS8qNt
9rfvfQAgAFJy0uk+5ix15kxqEri49+K+L4g4utjKZBDGUSK3plFepNnN5uirP/W3Db9HDx7gv73v
Hm7b/8Lvwf2H2ztf9R48fPjdg++2Hzy4/9V27+FO7/5XYvvPRaP+t8iLIBPiqw/pNMnTZOm4297/
i/621oTZdbGxIYAbSRjEaSL1cxFHF1kA/65ttVow/Fk6v8miybQQ7Wcd0fv+r99v7Gzv9MRhJqU4
S8fFVZBJcZguAE4RpUlXDJLRZqslhDgHkGIcxVKM0qQIoiQXxVSKF8evxUu12JFarK0edLoiELks
RDpGCFm6KEBSAUqaiVmQBJMomRCQQl7jIDHP5GWULvL4Rh

Re: Slow history load with some values of HISTSIZE

2024-02-05 Thread Casey Johnson
That is more or less how I happened upon this behavior. I have been using bash 
for 20+ years and for most of that time I have had HISTSIZE and HISTFILESIZE 
large enough to be effectively unlimited. I recently took the time to learn the 
difference between HISTSIZE and HISTFILESIZE. I liked the idea of having a 
perpetual history (unlimited HISTFILESIZE) but only keeping the most recent 
entries in memory by default, since those are the most likely ones I would want 
to reuse. If I really want to mine the old entries on occasion, I can always 
bump up HISTSIZE and load the whole file. I think I started with something like 
HISTSIZE=25000 and was surprised by how long new terminals took to produce the 
first prompt. I started playing around with different values and it wasn't hard 
to guess that there was a whole lot of memmove() (or similar) happening.

From: alex xmb sw ratchev 
Sent: Monday, February 5, 2024 7:29 PM
To: Dale R. Worley 
Cc: Casey Johnson ; bug-bash 
Subject: Re: Slow history load with some values of HISTSIZE



On Mon, Feb 5, 2024, 18:09 Dale R. Worley 
mailto:wor...@alum.mit.edu>> wrote:
Casey Johnson mailto:stry...@hotmail.com>> writes:
> In a clean shell, execute:
> HISTFILE=alt-history.txt
> HISTSIZE=15
> history -r
> and then observe how long the last command runs before returning.

Though I expect that when you exit bash, the history file gets trimmed
to 150,000 lines, and then the "over 10 seconds to load" doesn't happen
again.

there is another HISTFILESIZE var
maybe ..

Dale



Re: Slow history load with some values of HISTSIZE

2024-02-10 Thread Casey Johnson
Thanks! I'm happy to help.

From: alex xmb sw ratchev 
Sent: Friday, February 9, 2024 9:13 PM
To: Chester Ramey 
Cc: Casey Johnson ; bug-bash 
Subject: Re: Slow history load with some values of HISTSIZE



On Sat, Feb 10, 2024, 03:06 Chet Ramey 
mailto:chet.ra...@case.edu>> wrote:
On 2/3/24 3:29 PM, Casey Johnson wrote:

> Bash Version: 5.1
> Patch Level: 16
> Release Status: release
>
> Description:
>
> The current implementation of the command history takes a long time to load
> a large HISTFILE when history is stifled.  This is because there is a
> memmove() for every line of HISTFILE once HISTSIZE lines have been loaded.
> If N is the number of lines in HISTFILE then the cost of the memmoves ends
> up being something like (N - HISTSIZE) * HISTSIZE, peaking when HISTSIZE is
> roughly N/2.

Thanks for the report and the patch. I applied it with some tweaks and
saw a dramatic performance improvement for very large history files (I
tested with one that was about 69 entries) where HISTSIZE is
significantly smaller (I tested with HISTSIZE=10): from around 11
seconds to less than 0.3 seconds.

++ speed optimization

greets ++

Chet

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.edu<mailto:c...@case.edu>
http://tiswww.cwru.edu/~chet/