2020-04-19 23:54 George Jones <fooolog...@gmail.com>:
> It looks like hash_search just does a linear walk if array entries
> to find elements in a list.

https://eludom.github.io/blog/20200418/
> and there it is, the linear search walking the list in hash_search()
>
> ```
> [...]
>
>   bucket = HASH_BUCKET (string, table, hv);
>
>   for (list = table->bucket_array ? table->bucket_array[bucket] : 0; list; 
> list = list->next)
>     {
> [...]
> ```

The associative arrays in `hashlib.c' are implemented by hash tables
as is clear from its name.  The main lookup of hash table algorithm is
done by the following line

  bucket = HASH_BUCKET (string, table, hv);

but not by the subsequent linear search.  The linear search is just a
workaround for the collision of hashes.  As far as the load factor of
the hash table is maintained properly, the linear search is O(1)
because the length of the list is O(1).

2020-04-19 23:54 George Jones <fooolog...@gmail.com>:
> This slows down (order N?) new inserts when the number of entries
> gets large.

I looked into `hashlib.c' and found that rehashing is actually not
implemented.  I just added a function to perform rehashing, and
the performance has been improved.

> Would there be any interest in merging a patch to add an option for
> making this faster (maybe using b-trees?)

https://eludom.github.io/blog/20200418/
> TODO Look for appropriate in-memory hash insert/lookup functions
> -  btrees ?

The B-tree is not the most appropriate option here.  The hash table is
more appropriate.  The B-tree can be used in the case that we want to
keep the ordering of the keys of associative arrays (e.g. we want to
enumerate items in ascending/descending order).  Bash associative
arrays do not ensure the ordering of the items, so the hash table can
be used as a more efficient choice and is already implemented in
`hashlib.c'.  We can simply add rehashing.

------------------------------------------------------------------------

I attach a patch `0001-hashlib-Implement-rehash.patch' which
implements the rehashing.

I also tested the performance.  The attached `test1.png' shows the
computational time of insertion of items before and after the fix.
The lines are fitted by the function `Time = C Size^alpha' where alpha
is the exponent.  Before the fix, there are two regimes depending on
the number of items: linear regime O(N) (alpha ~ 1) and quadratic
regime O(N^2) (alpha ~ 2).  The quadratic regime signals the linear
scaling of a single insertion.  After the fix, the prformance has been
improved, and the computational time scales linearly in entire region.
I also attach a script `test1.sh' which was used to measure the time.

--
Koichi
From 44690d6cb3eff84b770dee6f5fa590144621891a Mon Sep 17 00:00:00 2001
From: Koichi Murase <myoga.murase@gmail.com>
Date: Mon, 20 Apr 2020 03:29:51 +0900
Subject: [PATCH] hashlib: Implement rehash

---
 hashlib.c | 41 +++++++++++++++++++++++++++++++++++++++++
 hashlib.h |  1 +
 2 files changed, 42 insertions(+)

diff --git a/hashlib.c b/hashlib.c
index f8e3b09a..ec3d6010 100644
--- a/hashlib.c
+++ b/hashlib.c
@@ -39,6 +39,7 @@
 #define HASH_BUCKET(s, t, h) (((h) = hash_string (s)) & ((t)->nbuckets - 1))
 
 static BUCKET_CONTENTS *copy_bucket_array __P((BUCKET_CONTENTS *, sh_string_func_t *));
+static void hash_rehash __P((HASH_TABLE *));
 
 /* Make a new hash table with BUCKETS number of buckets.  Initialize
    each slot in the table to NULL. */
@@ -105,6 +106,40 @@ copy_bucket_array (ba, cpdata)
   return new_bucket;  
 }
 
+static void
+hash_rehash (table)
+     HASH_TABLE *table;
+{
+  int old_nbuckets, i, j;
+  BUCKET_CONTENTS **old_bucket_array, *item, *next;
+  unsigned int hv;
+
+  if (table == NULL || table->nentries > INT_MAX / HASH_REHASH_FACTOR) return;
+
+  old_nbuckets = table->nbuckets;
+  old_bucket_array = table->bucket_array;
+
+  table->nbuckets = table->nentries * HASH_REHASH_FACTOR;
+  table->bucket_array =
+    (BUCKET_CONTENTS **)xmalloc (table->nbuckets * sizeof (BUCKET_CONTENTS *));
+  for (i = 0; i < table->nbuckets; i++)
+    table->bucket_array[i] = (BUCKET_CONTENTS *)NULL;
+
+  for (j = 0; j < old_nbuckets; j++)
+    {
+      for (item = old_bucket_array[j]; item; item = next)
+	{
+	  next = item->next;
+	  i = HASH_BUCKET (item->key, table, hv);
+	  item->khash = hv;
+	  item->next = table->bucket_array[i];
+	  table->bucket_array[i] = item;
+	}
+    }
+
+  free (old_bucket_array);
+}
+
 HASH_TABLE *
 hash_copy (table, cpdata)
      HASH_TABLE *table;
@@ -198,6 +233,9 @@ hash_search (string, table, flags)
 
   if (flags & HASH_CREATE)
     {
+      if (table->nentries >= table->nbuckets)
+	hash_rehash (table);
+
       list = (BUCKET_CONTENTS *)xmalloc (sizeof (BUCKET_CONTENTS));
       list->next = table->bucket_array[bucket];
       table->bucket_array[bucket] = list;
@@ -269,6 +307,9 @@ hash_insert (string, table, flags)
 
   if (item == 0)
     {
+      if (table->nentries >= table->nbuckets)
+	hash_rehash (table);
+
       bucket = HASH_BUCKET (string, table, hv);
 
       item = (BUCKET_CONTENTS *)xmalloc (sizeof (BUCKET_CONTENTS));
diff --git a/hashlib.h b/hashlib.h
index 88ea778f..f8463e9f 100644
--- a/hashlib.h
+++ b/hashlib.h
@@ -74,6 +74,7 @@ extern unsigned int hash_string __P((const char *));
 
 /* Default number of buckets in the hash table. */
 #define DEFAULT_HASH_BUCKETS 128	/* must be power of two */
+#define HASH_REHASH_FACTOR 4
 
 #define HASH_ENTRIES(ht)	((ht) ? (ht)->nentries : 0)
 
-- 
2.21.1

Attachment: test1.sh
Description: Bourne shell script

Reply via email to