On Mon, 2006-03-13 at 22:10 -0500, jamal wrote:
> I will try to find the scripts. 

Just the data sets will do.  I can then pump it through my
python script.  But before launching into this, I don't
see the choice of algorithm as very important.  Yes, I
believe the 2.4 one was better.  But not by so much that
the greatly alters the effectiveness of u32.

> The end goal is to reduce the total lookups at the expense of more RAM.
> I dont see using 20M as a big deal for example.

Sigh.  No its not .. now.

> Imagine a few thousand hosts... and you want to distribute 
> these across many hash tables (i recall the test i had was for 64K
> unique hosts).
> One of the techniques you could use is to have several layers of hash
> tables before eventually hitting the terminal; at each  then you could
> spread the bits (of lets say the 3rd octet of src IP in 3 bit
> hash masks) in the same hash level at the diagonal level. 
> The result is many hash tables created (a lot of memory consumed) but
> faster lookups.
> In such a case a selection on a hash table could be made with bit
> pattern 101 or 111 etc as the mask with a hash table with only 8
> buckets.
> 
> I am not sure if that made sense.

If you use a tree of hash tables, where each hash table is
indexed off one byte - as I think you are suggesting, then
the two algorithms are very similar.  Identical, in fact,
unless the "byte" you are indexing off is split across
a natural 8 bit boundary.  If that is the case the two
algorithms will put things in different buckets, but there
still be only one filter element per bucket.

So the only case work comparing is where you are hashing
on more than 8 bits.  If the code space is dense (by that
I mean if you have N bits then the number of items (E say)
you have to distinguish between > 2^(N-1)), then you are
better off using a tree like you said.  Effectively you
aren't hashing - you are indexing into an array (if that 
makes any sense).

So it is only the non-dense cases, where E <<< 2^N where 
the hashing algorithm becomes important.  In my case, I 
was indexing off the tcp/udp port.  E==400 odd, and N==16.  
This is a very sparse use of the code space, and so the 
hashing mattered.  I was so concerned that I would end up 
with 50 or 60 items in one bucket, I actually went to the
trouble of plotting it using a python program.  I was
amazed at how well the original 2.4 hash worked.  I then
stunned that the resulting u32 filter didn't work - 
because of the bugs in sample.  (This is how I ended up
here - posting patches to netdev.) When I check the results
using 2.6 hashing function the results were different,
and worse, but no 50 item bucket disasters.

This is all navel gazing of course - based on one example.
Yours may reveal a different story.

> The equation you provided for evaluating the different hashes does look
> clever but i have never seen it used anywhere. What i did was to compute
> the std deviation from the best value.

Pulled it out of my arse :(.  However, it is a standard
deviation of sorts.  Plot the buckets on a graph.  The
bucket number is the X axis.  The number of items in the 
bucket is the Y axis.  Since every bucket should have the 
same number of items, the resulting graph for a perfect
hash should be a straight line, parallel to the X axis.  
Perhaps not exact, because you can't have 1/2 an item in 
the bucket, so you may get a 1 item difference between
some buckets.

The figure I calculated is the standard deviation of the
actual graph from this perfect one.


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to