speeding up hash_search?

2020-04-19 Thread George Jones
It looks like hash_search just does a linear walk if array entries to find
elements in a list.   This slows down (order N?) new inserts when the
number of entries gets large.

Would there be any interest in merging a patch to add an option for making
this faster (maybe using b-trees?)

My analysis here https://eludom.github.io/blog/20200418/


Thanks,
---george jones


Re: [PATCH] Implement rehashing for associative arrays (Re: speeding up hash_search?)

2020-04-19 Thread George Jones
Thank you.   Patch applied and (performance) tested with come tests I was
working on
https://github.com/eludom/snippits/tree/master/bash/tests  bottom line:

Before:
...

lines_per_sec 993.37 LINES 270 ELAPSED 2718
lines_per_sec 955.95 LINES 280 ELAPSED 2929
lines_per_sec 921.51 LINES 290 ELAPSED 3147
lines_per_sec 889.41 LINES 300 ELAPSED 3373
lines_per_sec 859.43 LINES 310 ELAPSED 3607


After:

...

lines_per_sec 5.00 LINES 260 ELAPSED 52
lines_per_sec 5.00 LINES 270 ELAPSED 54
lines_per_sec 5.00 LINES 280 ELAPSED 56
lines_per_sec 5.00 LINES 290 ELAPSED 58
lines_per_sec 5.00 LINES 300 ELAPSED 60
lines_per_sec 5.00 LINES 310 ELAPSED 62


---George Jones


On Sun, Apr 19, 2020 at 3:52 PM Koichi Murase 
wrote:

> 2020-04-19 23:54 George Jones :
> > It looks like hash_search just does a linear walk if array entries
> > to find elements in a list.
>
> https://eludom.github.io/blog/20200418/
> > and there it is, the linear search walking the list in hash_search()
> >
> > ```
> > [...]
> >
> >   bucket = HASH_BUCKET (string, table, hv);
> >
> >   for (list = table->bucket_array ? table->bucket_array[bucket] : 0;
> list; list = list->next)
> > {
> > [...]
> > ```
>
> The associative arrays in `hashlib.c' are implemented by hash tables
> as is clear from its name.  The main lookup of hash table algorithm is
> done by the following line
>
>   bucket = HASH_BUCKET (string, table, hv);
>
> but not by the subsequent linear search.  The linear search is just a
> workaround for the collision of hashes.  As far as the load factor of
> the hash table is maintained properly, the linear search is O(1)
> because the length of the list is O(1).
>
> 2020-04-19 23:54 George Jones :
> > This slows down (order N?) new inserts when the number of entries
> > gets large.
>
> I looked into `hashlib.c' and found that rehashing is actually not
> implemented.  I just added a function to perform rehashing, and
> the performance has been improved.
>
> > Would there be any interest in merging a patch to add an option for
> > making this faster (maybe using b-trees?)
>
> https://eludom.github.io/blog/20200418/
> > TODO Look for appropriate in-memory hash insert/lookup functions
> > -  btrees ?
>
> The B-tree is not the most appropriate option here.  The hash table is
> more appropriate.  The B-tree can be used in the case that we want to
> keep the ordering of the keys of associative arrays (e.g. we want to
> enumerate items in ascending/descending order).  Bash associative
> arrays do not ensure the ordering of the items, so the hash table can
> be used as a more efficient choice and is already implemented in
> `hashlib.c'.  We can simply add rehashing.
>
> 
>
> I attach a patch `0001-hashlib-Implement-rehash.patch' which
> implements the rehashing.
>
> I also tested the performance.  The attached `test1.png' shows the
> computational time of insertion of items before and after the fix.
> The lines are fitted by the function `Time = C Size^alpha' where alpha
> is the exponent.  Before the fix, there are two regimes depending on
> the number of items: linear regime O(N) (alpha ~ 1) and quadratic
> regime O(N^2) (alpha ~ 2).  The quadratic regime signals the linear
> scaling of a single insertion.  After the fix, the prformance has been
> improved, and the computational time scales linearly in entire region.
> I also attach a script `test1.sh' which was used to measure the time.
>
> --
> Koichi
>


Re: [PATCH] Implement rehashing for associative arrays (Re: speeding up hash_search?)

2020-04-20 Thread George Jones
No problem.   Glad you fixed it.   It's been a LONG time since I've
actually written C, so probably best if someone current did it.

On the parameters, I suggest you consider exposing the at user level as a
switch or env var.   My use case was pathologically large (and would have
been better on, e.g. Spark if that were and option in the environment).
 Even the old behavior was probably good enough for most people.  It's only
when you start abusing bash to to "big data" that the problem shows up.

Thanks again, and impressive quick work !

---george jones



On Mon, Apr 20, 2020, 10:48 AM Koichi Murase  wrote:

> 2020-04-20 23:05 Chet Ramey :
> > On 4/20/20 8:49 AM, Greg Wooledge wrote:
> > > On Mon, Apr 20, 2020 at 06:48:44PM +0900, Koichi Murase wrote:
> > >> Also, I am sorry that I disturbed your plan for contributing to Bash.
> > >> I actually initially doubted that the insertion with the current
> > >> implementation is O(N), so I created the test first and then found
> > >> that it is an easy fix rather than reimplementing it by B-tree or
> > >> other data structures.  I couldn't stop my interest in how much it is
> > >> improved by the easy fix.
> > >
> > > This should in no way make the OP feel that they didn't contribute.
> > > Spotting and diagnosing problems is important work, even if their
> > > proposed patch wasn't selected as the best solution.
> >
> > This is quite true.
>
> Dear All,
>
> Yes, thank you for clarification and sorry for confusing writing.  I
> did not mean George's contribution disappeared, but just I have
> disturbed his five-step plan on his blog where the remaining four
> steps are now canceled.  I have to add that identifying the problem is
> definitely the most non-trivial part because it is relatively
> straightforward to fix the problem once the problem is identified :).
> Also, I would like to thank George for additional testing of my patch.
>
> Of course, this also applies to my patches.  I think I have sent more
> than ten patches so far and many of them have been applied, but I am
> anyway happy if the problems are solved.  You can always reject my
> patches when you have better solutions.
>
> --
> Koichi
>


Re: [PATCH] Implement rehashing for associative arrays (Re: speeding up hash_search?)

2020-04-20 Thread George Jones
No real opinion on syntax.

Using something existing:

declare -A foo[SIZE]

seems sensible, especially if there was no semantic meaning (I'm not a fan
of syntax without semantics  clutter).

Big thing is that the new stuff for fringe new pathologic use cases (mine)
should not have negative impact (huge buffer preallocation) on the existing
?30 years? of users/scripts.

Some docs on how the SIZE is used (hint for preallocation of hash table
size, not hard limit on number of entries) probably also in order.

Thanks,
---george jones


On Mon, Apr 20, 2020, 4:23 PM Chet Ramey  wrote:

> On 4/20/20 11:16 AM, George Jones wrote:
> > No problem.   Glad you fixed it.   It's been a LONG time since I've
> > actually written C, so probably best if someone current did it.
> >
> > On the parameters, I suggest you consider exposing the at user level as a
> > switch or env var.   My use case was pathologically large (and would have
> > been better on, e.g. Spark if that were and option in the environment).
> >  Even the old behavior was probably good enough for most people.  It's
> only
> > when you start abusing bash to to "big data" that the problem shows up.
>
> I've been considering how to provide a way to let users indicate the size
> of an array when they declare it. `declare' accepts `declare -a name[size]'
> and `declare -A name[size]' for compatibility with ksh, but has never done
> anything with `size'.
>
> I could see allowing a number, or maybe an arithmetic expression, for SIZE
> to set the initial size of an associative array. Or maybe an additional
> option to `declare' would work better. I don't like using a shell variable
> for this.
>
> What does the group think?
>
> --
> ``The lyf so short, the craft so long to lerne.'' - Chaucer
>  ``Ars longa, vita brevis'' - Hippocrates
> Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/
>