https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97738
--- Comment #3 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---
Even faster code:
ctz = __builtin_ctz (value);
lowest_bit = value & - value;
left_bits = value + lowest_bit;
changed_bits = value ^ left_bits;
right_bits = changed_bits >> (ctz + 2);
return left_bits | right_bits;
The first two instructions get compiled directly (with -march=native)
to
blsi %edi, %edx
tzcntl %edi, %eax
