Radix tree set; efficient storage of a set of strings to match against; and efficient hash storage?

Ken Wesson Sat, 02 Apr 2011 16:16:03 -0700

(defn radix-tree-set [things]
  (let [sentinel (Object.)
        vs (cons sentinel things)
        nt (vec (repeat (count vs) nil))]
    [(zipmap vs (iterate inc 0)) nt sentinel nt]))


(defn radix-tree-set-contains? [[m nt sentinel t] s]
  (get-in t (map m (concat s [sentinel]))))

(defn radix-tree-set-conj* [m nt sentinel n s]
  (let [k (m (if s (first s) sentinel))
        c (n k)
        c (if c c nt)]
    (assoc n k
      (if s
        (radix-tree-set-conj* m nt sentinel c (next s))
        true))))

(defn radix-tree-set-conj [[m nt sentinel t] s]
  [m nt sentinel (radix-tree-set-conj* m nt sentinel t (seq s))])



user=> (def foo (reduce radix-tree-set-conj (radix-tree-set "abc")
["abc" "ab" "bca" "bcac" "bbc" "aaa" "acb" "acc"]))
#'user/foo
user=> (map #(radix-tree-set-contains? foo %) ["abc" "bbc" "bac" "acb"
"ab" "bc" "bcac" ""])
(true true nil true true nil true nil)



This allows O(log N)-time testing of strings or other sequences for
set membership and log N "insertion". Furthermore, storage is
potentially more efficient than a standard hash-set or sorted-set when
there are long prefixes that are each shared by many of the strings.

Now I'm also curious if there's a way to more efficiently store sets
of hashes. Hashes turn into fairly random-looking strings,
byte-arrays, or whatever format, which may perform as poorly,
storage-wise, in a radix-tree-set as in a normal tree-set or a
hash-set.

Whereas a set of strings is likely to have long common prefixes (e.g.
all the "Wesson, "-starting strings in a large collection of names
being tested for "Wesson, Ken" being a member), so its entropy is
found mainly at the tail, a sparse set of hashes has its entropy
mainly at the head: after the first few bytes there are almost always
few or no hashes with that prefix. On the other hand, the space of
prefixes of length k for some k < the hashes' length n will be close
to saturated.

The tree structure performs poorly on the dense prefix space,
storage-wise. However, a tree or even a linear list may perform fine
on the suffix space. This suggests:


1. Find a k such that for your expected number of objects, the prefix
   space of length k is fairly densely filled, but for a given prefix
   of length k there are only a handful of hashes.

2. Create an array of the entire prefix space. If we're treating
   hashes as byte arrays, that's a vector of length 256^k.

3. Store not the whole hashes but the suffixes: each array cell
   contains a seq of suffixes.

Let's suppose we wanted to store ten million SHA1 hashes. Those are
160-bit, or 20-byte, hashes. The physical storage of the hashes
themselves as byte arrays is 200 million bytes in the actual hashes,
plus several tens of millions more bytes in object headers and Java
array count fields. We're talking a respectable fraction of a gigabyte
here.

k = 2 saves 20 megabytes of actual hash storage and requires a prefix
array of only 64K elements, about a quarter megabyte if those elements
are four-byte machine pointers  -- there's some small savings.

k = 3 already looks less rosy: saves 30 megabytes, costs 64 megabytes
to store the prefix array.

Using a radix tree to store the suffixes for each prefix won't help
much: the suffixes are going to be pretty random.

On the other hand, assuming we're accepting a probability of hash
collisions, just how high can this probability be allowed to get?
Suppose again we want to store ten million hashes. Furthermore, we
want to be able to test ten *trillion* things for hashing against this
set. Assume no adversary will try to deliberately engineer hash
collisions. How far can we truncate the hashes without the odds of a
collision reaching 1%?

With SHA-1's 160 bits, the size of the hash space is 2^160 ~=  ... (/
160 (/ (Math/log 10) (Math/log 2))) ... carry the 2 ... 10^48 or so.
With ten trillion objects to test, the odds of a collision rise by a
factor of 10^13, from one in 10^48 to one in 10^35. That's still
astronomically tiny. The odds reach 1% if we shrink the hash space so
that this becomes 10^2, or the original hash space's size 10^15.
That's only ONE-THIRD the logarithm of the size of the SHA-1 hash
space, which means we can chuck the final 13 bytes of the SHA-1 and
not have to worry too much about false positives.

That leaves 7 bytes. Direct storage of this still takes ~200 megabytes
after accounting for object headers and count fields. But maybe it can
be shrunk using a prefix array.

Assume k=2. What is the most efficient storage of the 5-byte suffixes?
Individual byte arrays are right out; the overhead per array exceeds
the data! A radix tree won't be any better.

How many suffixes can be expected per prefix? Ten million is roughly
152 times larger than the size of the prefix space: enough larger that
the law of large numbers kicks in and the actual number of suffixes
for a particular prefix should be fairly close to 152, given a good
hash function.

152 is low enough to suggest that we may be able to get away with a
linear search, in which case just cramming the suffixes together
works. A Java byte array interpreted as a variable-length block of
fixed-length 5-byte records will cost 776 bytes if the object header
is 8 bytes, the array count field is another four, the pointer to it
in the prefix array is another four, and there are 152 records. In
practice, the array will have to be longer, to accommodate outliers or
because it starts at one record and is doubled in length each time it
must grow to accomodate a record. In that case, the typical array will
wind up sized to accommodate 256 records, for 1296 bytes, and a few
may even end up weighing in at 2576 bytes, by the time ten million
hashes have been added. Call it 1500 bytes per array, on average. Then
we're looking at 65536 * 1500 ~= 100 megabytes of storage.

So, storing the SHA-1s would take a third of a gig; storing the 7-byte
mini-hashes in a hash-set or something would still chew up 200 megs;
storing them as linear-searched suffix arrays in a prefix array cuts
it down to 100 and also admits of efficient serialization -- 65536
counted records with 32-bit counts will take up 1/4 megabyte (exactly)
on counts, plus 50 megabytes in suffixes, equals about 50 megabytes.

The in-memory usage is similar if we don't use doubling arrays,
instead copying the array on every write. That's also going to be
slower, but modern machines copy 1kb arrays fairly efficiently.


Here is code for a hash7-array-set structure that does so:


(defn hash7-array-set []
  (make-array (type (byte-array 0)) 65536))

(defn hash7-array-set-contains? [h7a h]
  (if-let [^bytes ba (aget h7a (+ (* (aget h 0) 256) (aget h 1)))]
    (let [l (alength ba)]
      (loop [i 0]
        (if (loop [i i j 2]
              (if (== (aget ba i) (aget h j))
                (if (not= j 6)
                  (recur (inc i) (inc j))
                  true)))
          true
          (let [i (+ i 5)]
            (if (< i l)
              (recur i))))))))

(defn hash7-array-set-put! [h7a h]
  (let [pfx (+ (* (aget h 0) 256) (aget h 1))]
    (aset h7a pfx
      (if-let [^bytes ba (aget h7a pfx)]
        (let [l (alength ba)
              ba2 (java.util.Arrays/copyOf ba (+ l 5))]
          (loop [i l j 2]
            (aset ba2 i (aget h j))
            (if (not= j 6)
              (recur (inc i) (inc j))))
          ba2)
        (java.util.Arrays/copyOfRange h 2 7)))
    h7a))

user=> (def foo (hash7-array-set-put! (hash7-array) (byte-array (map
byte [0 0 1 2 3 4 5]))))
#'user/foo
user=> (aget (hash7-array-set-put! foo (byte-array (map byte [0 0 6 7
8 9 10]))) 0)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
user=> (hash7-array-set-contains? foo (byte-array (map byte [0 0 6 7 8 9 9])))
nil
user=> (hash7-array-set-contains? foo (byte-array (map byte [0 0 6 7 8 9 10])))
true
user=> (hash7-array-set-contains? foo (byte-array (map byte [0 0 1 2 3 4 5])))
true
user=> (hash7-array-set-contains? foo (byte-array (map byte [0 0 1 2 3 4 7])))
nil
user=> (hash7-array-set-contains? foo (byte-array (map byte [0 0 2 3 4 5 6])))
nil


This is NOT a persistent, immutable structure, unlike radix-tree-set
-- it can't be without doubling or even tripling its memory use. So it
will need to be guarded against concurrent access. I'd suggest using
(locking foo ...) or using a ref, dosync, and (ensure foo) when using
hash7-array-set-contains? to avoid race conditions during the
contains? test.

Note also that it only accepts byte arrays for putting and testing,
and ignores anything past the seventh byte. It may throw an exception
if the byte array is shorter than 7, but won't always:

user=> (hash7-array-contains? foo (byte-array (map byte [0 0 2])))
nil
user=> (hash7-array-contains? foo (byte-array (map byte [0 0 1])))
#<CompilerException java.lang.ArrayIndexOutOfBoundsException: 3
(NO_SOURCE_FILE:625)>

A hash produced via MessageDigest is a byte array; a Java .hashCode is
not, but can be coerced into one with:

(defn ba-hash [obj]
  (let [h (hash obj)]
    (byte-array
      (map (comp byte #(if (> % 127) (- % 256) %))
        [(bit-shift-right (bit-and h 0xff000000) 24)
         (bit-shift-right (bit-and h 0xff0000) 16)
         (bit-shift-right (bit-and h 0xff00) 8)
         (bit-and h 0xff)
         0
         0
         0]))))

user=> (def bar (Object.))
#'user/bar
user=> bar
#<Object java.lang.Object@a70acd>
user=> (map (partial format "%x") (ba-hash bar))
("0" "a7" "a" "cd" "0" "0" "0")
user=>

Obviously, though, if you will ONLY be storing Java hashes in a
particular hash-array-set, you'll want to make a hash4-array-set that
will be much more efficient. That will chew up only about 20 megabytes
instead of 50 for an eventual size of ten million hashes. But then, if
the Java objects are going to be live for the duration, you might as
well just use a normal hash-set. The only reason you might want to put
Java hashes in one of these things is if there will be far more
objects' hashes in it than live objects. Such a scenario *could*
occur, if objects are persisted or the same ones keep being recreated
AND their hashes are computed from their contents -- strings again are
a plausible case here.

And, of course, false positives are possible (though not false
negatives), so I can think of two broad usage patterns:

1. An in-memory table used for a cheap weeding out of most negatives,
   after which an expensive test, such as a byte-by-byte file
   comparison, is used to check anything that gets past this.

2. Something that just has to be "good enough for government work",
   like a spyware filter.

   Spyware is an interesting case, because whereas viruses hide in
   infected executable files that all differ, and spam tends to
   contain randomized portions to defeat filters, spyware is just
   software; it gets installed somewhere and doesn't tend to vary
   much. So checking executables against a hash database of known
   spyware executables using this thing would give low rates of false
   positives and false negatives.

   False negatives would all result from a new irritant not having
   been added to the database yet, so this won't be any worse than
   other potential antispyware methods, and false positives would
   just result in a warning to the user that they could check up on
   to see whether the threat was real. If the tests were confined to
   BHOs and items in autoexec/startup registry keys, false positives
   would be very rare.

   A network-aware antispyware app could follow up on an initial
   positive with a network download of the actual spyware executable
   for byte-by-byte comparison with the suspect file, eliminating
   false positives entirely -- and this type of app normally is
   network-aware so as to autoupdate and keep the false negatives to
   a minimum.

   (Note that a 50 megabyte hash database is a hefty download, so
   you'd want updates to normally take the form of lists of new
   hashes rather than the whole DB. The app could download the whole
   DB on first use, then catch up via updates with newer modification
   dates; the server would host the current DB and all of the updates
   for the past year, or some such. If the DB got more than that far
   out of date the app would re-download the whole DB.)

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Radix tree set; efficient storage of a set of strings to match against; and efficient hash storage?

Reply via email to