I wonder how much de-duping the really old history would help. It seems
that HISTCONTROL='erasedups' only affects the history of the current
bash process (i.e. commands that were typed since you started that
shell), and it leaves all the stuff it loaded from .bash_history alone.
As a quick test, removing duplicates from a 4 MB history file reduced
the number of commands in it from 125236 to 36937, so that file was
about 70% duplicated data (not quite, 'cause the longer and more
interesting commands mostly stayed...). Doing that to your 11 MB file
might get rid of that loading delay.
Of course, de-duplicating the history destroys its role of "accurately
record everything I've done", so if you also use your history for that
it's not a good idea. For that latter use though, I can't think of a
good reason for loading it on shell start, so maybe those roles should
be split -- .bash_log and .bash_commands? The log is write-only, never
clobbered, and has the equivalent of a HISTTIMEFORMAT set; the commands
file is an efficiently stored hash table of unique commands, maybe with
tweakable parameters for how "interesting" a command has to be to go in
it (store "mount -o loop,ro,uid=1000 -t vfat /some/file /mnt/temp" but
ignore "cd ~" 'cause you really don't need Ctrl+R to remember the latter).
~Felix.
On 16/06/11 12:55, Bradley M. Kuhn wrote:
I agree with Marcel's points about keeping a big bash history, although
I wasn't sure if discussing "why" users keep a big bash history was on
topic or not.
Marcel (Felix) Giannelia wrote at 13:16 (EDT) on Tuesday:
A .bash_history file going back years and years is still only a few
megs,
Actually, this relates to a thing I'd been looking into recently. My
bash history is 11MB now, and on some machines I have a noticeable load
time as it reads the history. I'd thought about adding support for
incremental read to bash history/readline code. Basically, it would
load only the parts of the history it needed based on the history
requested. Obviously running "history" would read it all, but if
reverse-search was requested, it could perhaps be read incrementally
somehow.
Given that this would be a big change (esp. to make it seamless to
existing readline API users), and would provide a feature clearly that
isn't universally desired (ability to have really big history files),
I'm asking, albeit with some trepidation, if such a rewrite of the
history reading/writing code would likely be accepted, and if so what it
would need to look like to be an acceptable patch.
I noticed someone previously attempted to implement mmap() in the
history code, but it's #ifdef'd out (IIRC from my investigations a few
weeks ago). I theorized that it was #ifdef'd out because implementing
mmap() didn't help anything, since the history reading code immediately
goes through the whole array of history anyway, so the file will be
immediately read in to RAM anyway the way the code currently operates,
even if you mmap() it. In other words, just slapping mmap() in place
wouldn't work (in fact, it's seem to have been tried and abandoned);
more in-depth changes would be made.
Thoughts on this idea?