https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113764

--- Comment #2 from Roger Sayle <roger at nextmovesoftware dot com> ---
Investigating further, the thinking behind GCC's current behaviour can be found
in Agner Fog's instruction tables; on many architectures BSR is much slower
than LZCNT.

Legacy AMD:      BSR=4 cycles,  LZCNT=2 cycles
AMD BOBCAT:      BSR=6 cycles,  LZCNT=5 cycles
AMD JAGUAR:      BSR=4 cycles,  LZCNT=1 cycle
AMD ZEN[1-3]:    BSR=4 cycles,  LZCNT=1 cycle
AMD ZEN4:        BSR=1 cycle,   LZCNT=1 cycle
INTEL:           BSR=3 cycles,  LZCNT=3 cycles
KNIGHTS LANDING: BSR=11 cycles, LZCNT=3 cycles

Hence using bsr is only "better" in some (but not all) contexts, and a
reasonable default (for generic tuning) is to ignore BSR when LZCNT is
available, as it's only one extra cycle of latency to perform the XOR.

The correct solution is to add a tuning parameter to the x86 backend, to
control whether it's beneficial to use BSR when LZCNT is available, for example
when optimizing for size with -Os or -Oz.  This is more reasonable now that
current Intel and AMD architectures have the same latency for BSR and LZCNT,
than when LZCNT first appeared (explaining !TARGET_LZCNT in i386.md).

Reply via email to