https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113764
--- Comment #2 from Roger Sayle <roger at nextmovesoftware dot com> --- Investigating further, the thinking behind GCC's current behaviour can be found in Agner Fog's instruction tables; on many architectures BSR is much slower than LZCNT. Legacy AMD: BSR=4 cycles, LZCNT=2 cycles AMD BOBCAT: BSR=6 cycles, LZCNT=5 cycles AMD JAGUAR: BSR=4 cycles, LZCNT=1 cycle AMD ZEN[1-3]: BSR=4 cycles, LZCNT=1 cycle AMD ZEN4: BSR=1 cycle, LZCNT=1 cycle INTEL: BSR=3 cycles, LZCNT=3 cycles KNIGHTS LANDING: BSR=11 cycles, LZCNT=3 cycles Hence using bsr is only "better" in some (but not all) contexts, and a reasonable default (for generic tuning) is to ignore BSR when LZCNT is available, as it's only one extra cycle of latency to perform the XOR. The correct solution is to add a tuning parameter to the x86 backend, to control whether it's beneficial to use BSR when LZCNT is available, for example when optimizing for size with -Os or -Oz. This is more reasonable now that current Intel and AMD architectures have the same latency for BSR and LZCNT, than when LZCNT first appeared (explaining !TARGET_LZCNT in i386.md).