On Mon, 2006-03-13 at 22:10 -0500, jamal wrote: > I will try to find the scripts.
Just the data sets will do. I can then pump it through my python script. But before launching into this, I don't see the choice of algorithm as very important. Yes, I believe the 2.4 one was better. But not by so much that the greatly alters the effectiveness of u32. > The end goal is to reduce the total lookups at the expense of more RAM. > I dont see using 20M as a big deal for example. Sigh. No its not .. now. > Imagine a few thousand hosts... and you want to distribute > these across many hash tables (i recall the test i had was for 64K > unique hosts). > One of the techniques you could use is to have several layers of hash > tables before eventually hitting the terminal; at each then you could > spread the bits (of lets say the 3rd octet of src IP in 3 bit > hash masks) in the same hash level at the diagonal level. > The result is many hash tables created (a lot of memory consumed) but > faster lookups. > In such a case a selection on a hash table could be made with bit > pattern 101 or 111 etc as the mask with a hash table with only 8 > buckets. > > I am not sure if that made sense. If you use a tree of hash tables, where each hash table is indexed off one byte - as I think you are suggesting, then the two algorithms are very similar. Identical, in fact, unless the "byte" you are indexing off is split across a natural 8 bit boundary. If that is the case the two algorithms will put things in different buckets, but there still be only one filter element per bucket. So the only case work comparing is where you are hashing on more than 8 bits. If the code space is dense (by that I mean if you have N bits then the number of items (E say) you have to distinguish between > 2^(N-1)), then you are better off using a tree like you said. Effectively you aren't hashing - you are indexing into an array (if that makes any sense). So it is only the non-dense cases, where E <<< 2^N where the hashing algorithm becomes important. In my case, I was indexing off the tcp/udp port. E==400 odd, and N==16. This is a very sparse use of the code space, and so the hashing mattered. I was so concerned that I would end up with 50 or 60 items in one bucket, I actually went to the trouble of plotting it using a python program. I was amazed at how well the original 2.4 hash worked. I then stunned that the resulting u32 filter didn't work - because of the bugs in sample. (This is how I ended up here - posting patches to netdev.) When I check the results using 2.6 hashing function the results were different, and worse, but no 50 item bucket disasters. This is all navel gazing of course - based on one example. Yours may reveal a different story. > The equation you provided for evaluating the different hashes does look > clever but i have never seen it used anywhere. What i did was to compute > the std deviation from the best value. Pulled it out of my arse :(. However, it is a standard deviation of sorts. Plot the buckets on a graph. The bucket number is the X axis. The number of items in the bucket is the Y axis. Since every bucket should have the same number of items, the resulting graph for a perfect hash should be a straight line, parallel to the X axis. Perhaps not exact, because you can't have 1/2 an item in the bucket, so you may get a 1 item difference between some buckets. The figure I calculated is the standard deviation of the actual graph from this perfect one. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html