> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97445 > > --- Comment #33 from Jakub Jelinek <jakub at gcc dot gnu.org> --- > (In reply to Jan Hubicka from comment #32) > > get_order is a wrapper around ffs64. This can be implemented w/o asm > > statement as follows: > > int > > my_fls64 (__u64 x) > > { > > if (!x) > > return 0; > > return 64 - __builtin_clzl (x); > > } > > > > This results in longer assembly than the kernel asm implementation. If > > that matters I would replace builtin_constnat_p part of get_order by this > > implementation that is more transparent to the code size estimation and > > things will get inlined. > > Better __builtin_clzll so that it works also on 32-bit arches. > Anyway, if kernel's fls64 results in better code than the my_fls64, we should > look at GCC's code generation for that case.
Original asm is: __attribute__ ((noinline)) int fls64(__u64 x) { int bitpos = -1; asm("bsrq %1,%q0" : "+r" (bitpos) : "rm" (x)); return bitpos + 1; } There seems to be bug in bsr{q} pattern. I can make GCC produce same code with: __attribute__ ((noinline)) int my_fls64 (__u64 x) { asm volatile ("movl $-1, %eax"); return (__builtin_clzll (x) ^ 63) + 1; } But obviously the volatile asm should not be needed. I think bsrq is incorrectly modelled as returning full register (define_insn "bsr_rex64" [(set (match_operand:DI 0 "register_operand" "=r") (minus:DI (const_int 63) (clz:DI (match_operand:DI 1 "nonimmediate_operand" "rm")))) (clobber (reg:CC FLAGS_REG))] "TARGET_64BIT" "bsr{q}\t{%1, %0|%0, %1}" [(set_attr "type" "alu1") (set_attr "prefix_0f" "1") (set_attr "znver1_decode" "vector") (set_attr "mode" "DI")])