[Bug target/113764] [X86] Generates lzcnt when bsr is sufficient

roger at nextmovesoftware dot com via Gcc-bugs Fri, 09 Feb 2024 10:35:32 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113764


--- Comment #2 from Roger Sayle <roger at nextmovesoftware dot com> ---
Investigating further, the thinking behind GCC's current behaviour can be found
in Agner Fog's instruction tables; on many architectures BSR is much slower
than LZCNT.

Legacy AMD:      BSR=4 cycles,  LZCNT=2 cycles
AMD BOBCAT:      BSR=6 cycles,  LZCNT=5 cycles
AMD JAGUAR:      BSR=4 cycles,  LZCNT=1 cycle
AMD ZEN[1-3]:    BSR=4 cycles,  LZCNT=1 cycle
AMD ZEN4:        BSR=1 cycle,   LZCNT=1 cycle
INTEL:           BSR=3 cycles,  LZCNT=3 cycles
KNIGHTS LANDING: BSR=11 cycles, LZCNT=3 cycles

Hence using bsr is only "better" in some (but not all) contexts, and a
reasonable default (for generic tuning) is to ignore BSR when LZCNT is
available, as it's only one extra cycle of latency to perform the XOR.

The correct solution is to add a tuning parameter to the x86 backend, to
control whether it's beneficial to use BSR when LZCNT is available, for example
when optimizing for size with -Os or -Oz.  This is more reasonable now that
current Intel and AMD architectures have the same latency for BSR and LZCNT,
than when LZCNT first appeared (explaining !TARGET_LZCNT in i386.md).

[Bug target/113764] [X86] Generates lzcnt when bsr is sufficient

Reply via email to