Hi Paul, > substituting something > more straightforward than a de Bruijn hash (possibly faster?). > ... > +#if !defined _GL_STDBIT_HAS_BUILTIN_CLZ && !_MSC_VER > +/* __gl_stdbit_clztab[B] is the number of leading zeros in > + the 8-bit byte with value B. */ > +char const __gl_stdbit_clztab[256] = > + { > + 8, > + 7, > + 6, 6, > + 5, 5, 5, 5, > + 4, 4, 4, 4, 4, 4, 4, 4, > + 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, > + > + 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, > + 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, > + > + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, > + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, > + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, > + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, > + > + /* The rest is zero. */ > + }; > +#endif
I can't agree to the claim that a lookup in a 256-bytes table is preferrable to a de Bruijn hash because "faster". The reason is that such a 256-bytes table will tend to occupy 256 bytes in the CPU's L1 cache, and thus reduce the ability of other code to use the L1 cache. Knowing that an L1 cache miss is about 200x slower than an L1 cache hit, this is a big penalty for other code. The code with the de Bruijn hash uses 32 bytes from the L1 cache, 8 times less. It is normal that in microbenchmarks such L1 cache considerations are ignored, and thus code which uses fewer instructions but more memory comes out faster. But that does not mean that it is preferrable. Another data point is that it's well-known that excessive inlining by a compiler causes a slowdown. I don't know whether this is due to worse register allocation (i.e. GCC allocating more variables into the stack than in registers) or to increased code size. Do you have data on this? Having said that, I have no objection to the patch, because most platforms use GCC or clang-derived compilers nowadays. Bruno