[Bug tree-optimization/123330] Optimization fails with standard branchless idiom with power of 2s

tobi at gcc dot gnu.org via Gcc-bugs Mon, 05 Jan 2026 06:42:19 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123330


--- Comment #6 from Tobias Schlüter <tobi at gcc dot gnu.org> ---
Andrew, I cannot confirm your assertion that gcc can optimize the ternary to
use a conditional move always:
https://godbolt.org/z/hW6r9vnT7

This is a more complete extract of the code in question
=================
#include <stdint.h>

auto clz(uint64_t x) noexcept -> int {
#if __has_builtin(__builtin_clzll)
  return __builtin_clzll(x);
#elif defined(_MSC_VER) && defined(__AVX2__) && defined(_M_AMD64)
  // Use lzcnt only on AVX2-capable CPUs that have this BMI instruction.
  return __lzcnt64(x);
#elif defined(_MSC_VER) && (defined(_M_AMD64) || defined(_M_ARM64))
  unsigned long idx;
  _BitScanReverse64(&idx, x);  // Fallback to the BSR instruction.
  return 63 - idx;
#elif defined(_MSC_VER)
  // Fallback to the 32-bit BSR instruction.
  unsigned long idx;
  if (_BitScanReverse(&idx, uint32_t(x >> 32))) return 31 - idx;
  _BitScanReverse(&idx, uint32_t(x));
  return 63 - idx;
#else
  int n = 64;
  for (; x > 0; x >>= 1) --n;
  return n;
#endif
}

auto count_zeros(uint16_t u) -> uint64_t {
    return u == 0 ? 0 : 64 - clz(u);
}

auto count_zeros_branchless(uint16_t u) -> int {
    // This is always optimized to a cmov
    return (u != 0) * clz(u);
}
===================

Gives: without -mlzcnt
clz(unsigned long):
        bsr     rax, rdi
        xor     eax, 63
        ret
count_zeros(unsigned short):
        xor     eax, eax
        test    di, di
        je      .L3
        movzx   edi, di
        bsr     rdi, rdi
        lea     eax, [rdi+1]
.L3:
        ret
and with -mlzcnt
clz(unsigned long):
        xor     eax, eax
        lzcnt   rax, rdi
        ret
count_zeros(unsigned short):
        movzx   ecx, di
        mov     edx, 64
        xor     eax, eax
        lzcnt   rcx, rcx
        sub     edx, ecx
        test    di, di
        cmovne  eax, edx
        ret
which indicates that it avoids the undefined processor always, even if it knows
that its result is rejected.  This indicates that the choice of using a jump is
not driven by performance considerations.  (By accident I also discovered that
it never uses the cmov if the return type of count_zeros is uint64_t instead of
int.  I don't think there's a reason for that.)

[Bug tree-optimization/123330] Optimization fails with standard branchless idiom with power of 2s

Reply via email to