speeding up hash_search?
It looks like hash_search just does a linear walk if array entries to find elements in a list. This slows down (order N?) new inserts when the number of entries gets large. Would there be any interest in merging a patch to add an option for making this faster (maybe using b-trees?) My analysis here https://eludom.github.io/blog/20200418/ Thanks, ---george jones
Re: [PATCH] Implement rehashing for associative arrays (Re: speeding up hash_search?)
Thank you. Patch applied and (performance) tested with come tests I was working on https://github.com/eludom/snippits/tree/master/bash/tests bottom line: Before: ... lines_per_sec 993.37 LINES 270 ELAPSED 2718 lines_per_sec 955.95 LINES 280 ELAPSED 2929 lines_per_sec 921.51 LINES 290 ELAPSED 3147 lines_per_sec 889.41 LINES 300 ELAPSED 3373 lines_per_sec 859.43 LINES 310 ELAPSED 3607 After: ... lines_per_sec 5.00 LINES 260 ELAPSED 52 lines_per_sec 5.00 LINES 270 ELAPSED 54 lines_per_sec 5.00 LINES 280 ELAPSED 56 lines_per_sec 5.00 LINES 290 ELAPSED 58 lines_per_sec 5.00 LINES 300 ELAPSED 60 lines_per_sec 5.00 LINES 310 ELAPSED 62 ---George Jones On Sun, Apr 19, 2020 at 3:52 PM Koichi Murase wrote: > 2020-04-19 23:54 George Jones : > > It looks like hash_search just does a linear walk if array entries > > to find elements in a list. > > https://eludom.github.io/blog/20200418/ > > and there it is, the linear search walking the list in hash_search() > > > > ``` > > [...] > > > > bucket = HASH_BUCKET (string, table, hv); > > > > for (list = table->bucket_array ? table->bucket_array[bucket] : 0; > list; list = list->next) > > { > > [...] > > ``` > > The associative arrays in `hashlib.c' are implemented by hash tables > as is clear from its name. The main lookup of hash table algorithm is > done by the following line > > bucket = HASH_BUCKET (string, table, hv); > > but not by the subsequent linear search. The linear search is just a > workaround for the collision of hashes. As far as the load factor of > the hash table is maintained properly, the linear search is O(1) > because the length of the list is O(1). > > 2020-04-19 23:54 George Jones : > > This slows down (order N?) new inserts when the number of entries > > gets large. > > I looked into `hashlib.c' and found that rehashing is actually not > implemented. I just added a function to perform rehashing, and > the performance has been improved. > > > Would there be any interest in merging a patch to add an option for > > making this faster (maybe using b-trees?) > > https://eludom.github.io/blog/20200418/ > > TODO Look for appropriate in-memory hash insert/lookup functions > > - btrees ? > > The B-tree is not the most appropriate option here. The hash table is > more appropriate. The B-tree can be used in the case that we want to > keep the ordering of the keys of associative arrays (e.g. we want to > enumerate items in ascending/descending order). Bash associative > arrays do not ensure the ordering of the items, so the hash table can > be used as a more efficient choice and is already implemented in > `hashlib.c'. We can simply add rehashing. > > > > I attach a patch `0001-hashlib-Implement-rehash.patch' which > implements the rehashing. > > I also tested the performance. The attached `test1.png' shows the > computational time of insertion of items before and after the fix. > The lines are fitted by the function `Time = C Size^alpha' where alpha > is the exponent. Before the fix, there are two regimes depending on > the number of items: linear regime O(N) (alpha ~ 1) and quadratic > regime O(N^2) (alpha ~ 2). The quadratic regime signals the linear > scaling of a single insertion. After the fix, the prformance has been > improved, and the computational time scales linearly in entire region. > I also attach a script `test1.sh' which was used to measure the time. > > -- > Koichi >
Re: [PATCH] Implement rehashing for associative arrays (Re: speeding up hash_search?)
No problem. Glad you fixed it. It's been a LONG time since I've actually written C, so probably best if someone current did it. On the parameters, I suggest you consider exposing the at user level as a switch or env var. My use case was pathologically large (and would have been better on, e.g. Spark if that were and option in the environment). Even the old behavior was probably good enough for most people. It's only when you start abusing bash to to "big data" that the problem shows up. Thanks again, and impressive quick work ! ---george jones On Mon, Apr 20, 2020, 10:48 AM Koichi Murase wrote: > 2020-04-20 23:05 Chet Ramey : > > On 4/20/20 8:49 AM, Greg Wooledge wrote: > > > On Mon, Apr 20, 2020 at 06:48:44PM +0900, Koichi Murase wrote: > > >> Also, I am sorry that I disturbed your plan for contributing to Bash. > > >> I actually initially doubted that the insertion with the current > > >> implementation is O(N), so I created the test first and then found > > >> that it is an easy fix rather than reimplementing it by B-tree or > > >> other data structures. I couldn't stop my interest in how much it is > > >> improved by the easy fix. > > > > > > This should in no way make the OP feel that they didn't contribute. > > > Spotting and diagnosing problems is important work, even if their > > > proposed patch wasn't selected as the best solution. > > > > This is quite true. > > Dear All, > > Yes, thank you for clarification and sorry for confusing writing. I > did not mean George's contribution disappeared, but just I have > disturbed his five-step plan on his blog where the remaining four > steps are now canceled. I have to add that identifying the problem is > definitely the most non-trivial part because it is relatively > straightforward to fix the problem once the problem is identified :). > Also, I would like to thank George for additional testing of my patch. > > Of course, this also applies to my patches. I think I have sent more > than ten patches so far and many of them have been applied, but I am > anyway happy if the problems are solved. You can always reject my > patches when you have better solutions. > > -- > Koichi >
Re: [PATCH] Implement rehashing for associative arrays (Re: speeding up hash_search?)
No real opinion on syntax. Using something existing: declare -A foo[SIZE] seems sensible, especially if there was no semantic meaning (I'm not a fan of syntax without semantics clutter). Big thing is that the new stuff for fringe new pathologic use cases (mine) should not have negative impact (huge buffer preallocation) on the existing ?30 years? of users/scripts. Some docs on how the SIZE is used (hint for preallocation of hash table size, not hard limit on number of entries) probably also in order. Thanks, ---george jones On Mon, Apr 20, 2020, 4:23 PM Chet Ramey wrote: > On 4/20/20 11:16 AM, George Jones wrote: > > No problem. Glad you fixed it. It's been a LONG time since I've > > actually written C, so probably best if someone current did it. > > > > On the parameters, I suggest you consider exposing the at user level as a > > switch or env var. My use case was pathologically large (and would have > > been better on, e.g. Spark if that were and option in the environment). > > Even the old behavior was probably good enough for most people. It's > only > > when you start abusing bash to to "big data" that the problem shows up. > > I've been considering how to provide a way to let users indicate the size > of an array when they declare it. `declare' accepts `declare -a name[size]' > and `declare -A name[size]' for compatibility with ksh, but has never done > anything with `size'. > > I could see allowing a number, or maybe an arithmetic expression, for SIZE > to set the initial size of an associative array. Or maybe an additional > option to `declare' would work better. I don't like using a shell variable > for this. > > What does the group think? > > -- > ``The lyf so short, the craft so long to lerne.'' - Chaucer > ``Ars longa, vita brevis'' - Hippocrates > Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/ >