On Thu, Jul 20, 2017 at 01:24:49PM +0200, Peter Zijlstra wrote:
> ~/tmp$ gcc -o sqrt sqrt.c -lm -O2 -DLOOPS=10000000 -DNEW=1 -DFLS=1 -DANSHUL=1 
>  ; perf stat --repeat 10 -e cycles:u -e instructions:u ./sqrt
> 
>  Performance counter stats for './sqrt' (10 runs):
> 
>        328,415,775      cycles:u                                              
>         ( +-  0.15% )
>      1,138,579,704      instructions:u            #    3.47  insn per cycle   
>         ( +-  0.00% )
> 
>        0.088703205 seconds time elapsed  

> static __always_inline unsigned long fls(unsigned long word)
> {
>       asm("rep; bsr %1,%0"
>               : "=r" (word)
>               : "rm" (word));
>       return BITS_PER_LONG - 1 - word;
> }

That is actually "lzcnt", if I used the regular fls implementation:


static __always_inline unsigned long __fls(unsigned long word)
{
        asm("bsr %1,%0"
                : "=r" (word)
                : "rm" (word));
        return word;
}

It ends up slightly more expensive:

~/tmp$ gcc -o sqrt sqrt.c -lm -O2 -DLOOPS=10000000 -DNEW=1 -DFLS=1 -DANSHUL=1  
; perf stat --repeat 10 -e cycles:u -e instructions:u ./sqrt

 Performance counter stats for './sqrt' (10 runs):

       384,842,215      cycles:u                                                
      ( +-  0.08% )
     1,118,579,712      instructions:u            #    2.91  insn per cycle     
      ( +-  0.00% )

       0.103018001 seconds time elapsed  


Still loads cheaper than pretty much any other combination.


Reply via email to