2020-04-19 23:54 George Jones <fooolog...@gmail.com>: > It looks like hash_search just does a linear walk if array entries > to find elements in a list.
https://eludom.github.io/blog/20200418/ > and there it is, the linear search walking the list in hash_search() > > ``` > [...] > > bucket = HASH_BUCKET (string, table, hv); > > for (list = table->bucket_array ? table->bucket_array[bucket] : 0; list; > list = list->next) > { > [...] > ``` The associative arrays in `hashlib.c' are implemented by hash tables as is clear from its name. The main lookup of hash table algorithm is done by the following line bucket = HASH_BUCKET (string, table, hv); but not by the subsequent linear search. The linear search is just a workaround for the collision of hashes. As far as the load factor of the hash table is maintained properly, the linear search is O(1) because the length of the list is O(1). 2020-04-19 23:54 George Jones <fooolog...@gmail.com>: > This slows down (order N?) new inserts when the number of entries > gets large. I looked into `hashlib.c' and found that rehashing is actually not implemented. I just added a function to perform rehashing, and the performance has been improved. > Would there be any interest in merging a patch to add an option for > making this faster (maybe using b-trees?) https://eludom.github.io/blog/20200418/ > TODO Look for appropriate in-memory hash insert/lookup functions > - btrees ? The B-tree is not the most appropriate option here. The hash table is more appropriate. The B-tree can be used in the case that we want to keep the ordering of the keys of associative arrays (e.g. we want to enumerate items in ascending/descending order). Bash associative arrays do not ensure the ordering of the items, so the hash table can be used as a more efficient choice and is already implemented in `hashlib.c'. We can simply add rehashing. ------------------------------------------------------------------------ I attach a patch `0001-hashlib-Implement-rehash.patch' which implements the rehashing. I also tested the performance. The attached `test1.png' shows the computational time of insertion of items before and after the fix. The lines are fitted by the function `Time = C Size^alpha' where alpha is the exponent. Before the fix, there are two regimes depending on the number of items: linear regime O(N) (alpha ~ 1) and quadratic regime O(N^2) (alpha ~ 2). The quadratic regime signals the linear scaling of a single insertion. After the fix, the prformance has been improved, and the computational time scales linearly in entire region. I also attach a script `test1.sh' which was used to measure the time. -- Koichi
From 44690d6cb3eff84b770dee6f5fa590144621891a Mon Sep 17 00:00:00 2001 From: Koichi Murase <myoga.murase@gmail.com> Date: Mon, 20 Apr 2020 03:29:51 +0900 Subject: [PATCH] hashlib: Implement rehash --- hashlib.c | 41 +++++++++++++++++++++++++++++++++++++++++ hashlib.h | 1 + 2 files changed, 42 insertions(+) diff --git a/hashlib.c b/hashlib.c index f8e3b09a..ec3d6010 100644 --- a/hashlib.c +++ b/hashlib.c @@ -39,6 +39,7 @@ #define HASH_BUCKET(s, t, h) (((h) = hash_string (s)) & ((t)->nbuckets - 1)) static BUCKET_CONTENTS *copy_bucket_array __P((BUCKET_CONTENTS *, sh_string_func_t *)); +static void hash_rehash __P((HASH_TABLE *)); /* Make a new hash table with BUCKETS number of buckets. Initialize each slot in the table to NULL. */ @@ -105,6 +106,40 @@ copy_bucket_array (ba, cpdata) return new_bucket; } +static void +hash_rehash (table) + HASH_TABLE *table; +{ + int old_nbuckets, i, j; + BUCKET_CONTENTS **old_bucket_array, *item, *next; + unsigned int hv; + + if (table == NULL || table->nentries > INT_MAX / HASH_REHASH_FACTOR) return; + + old_nbuckets = table->nbuckets; + old_bucket_array = table->bucket_array; + + table->nbuckets = table->nentries * HASH_REHASH_FACTOR; + table->bucket_array = + (BUCKET_CONTENTS **)xmalloc (table->nbuckets * sizeof (BUCKET_CONTENTS *)); + for (i = 0; i < table->nbuckets; i++) + table->bucket_array[i] = (BUCKET_CONTENTS *)NULL; + + for (j = 0; j < old_nbuckets; j++) + { + for (item = old_bucket_array[j]; item; item = next) + { + next = item->next; + i = HASH_BUCKET (item->key, table, hv); + item->khash = hv; + item->next = table->bucket_array[i]; + table->bucket_array[i] = item; + } + } + + free (old_bucket_array); +} + HASH_TABLE * hash_copy (table, cpdata) HASH_TABLE *table; @@ -198,6 +233,9 @@ hash_search (string, table, flags) if (flags & HASH_CREATE) { + if (table->nentries >= table->nbuckets) + hash_rehash (table); + list = (BUCKET_CONTENTS *)xmalloc (sizeof (BUCKET_CONTENTS)); list->next = table->bucket_array[bucket]; table->bucket_array[bucket] = list; @@ -269,6 +307,9 @@ hash_insert (string, table, flags) if (item == 0) { + if (table->nentries >= table->nbuckets) + hash_rehash (table); + bucket = HASH_BUCKET (string, table, hv); item = (BUCKET_CONTENTS *)xmalloc (sizeof (BUCKET_CONTENTS)); diff --git a/hashlib.h b/hashlib.h index 88ea778f..f8463e9f 100644 --- a/hashlib.h +++ b/hashlib.h @@ -74,6 +74,7 @@ extern unsigned int hash_string __P((const char *)); /* Default number of buckets in the hash table. */ #define DEFAULT_HASH_BUCKETS 128 /* must be power of two */ +#define HASH_REHASH_FACTOR 4 #define HASH_ENTRIES(ht) ((ht) ? (ht)->nentries : 0) -- 2.21.1
test1.sh
Description: Bourne shell script