https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123330
--- Comment #6 from Tobias Schlüter <tobi at gcc dot gnu.org> --- Andrew, I cannot confirm your assertion that gcc can optimize the ternary to use a conditional move always: https://godbolt.org/z/hW6r9vnT7 This is a more complete extract of the code in question ================= #include <stdint.h> auto clz(uint64_t x) noexcept -> int { #if __has_builtin(__builtin_clzll) return __builtin_clzll(x); #elif defined(_MSC_VER) && defined(__AVX2__) && defined(_M_AMD64) // Use lzcnt only on AVX2-capable CPUs that have this BMI instruction. return __lzcnt64(x); #elif defined(_MSC_VER) && (defined(_M_AMD64) || defined(_M_ARM64)) unsigned long idx; _BitScanReverse64(&idx, x); // Fallback to the BSR instruction. return 63 - idx; #elif defined(_MSC_VER) // Fallback to the 32-bit BSR instruction. unsigned long idx; if (_BitScanReverse(&idx, uint32_t(x >> 32))) return 31 - idx; _BitScanReverse(&idx, uint32_t(x)); return 63 - idx; #else int n = 64; for (; x > 0; x >>= 1) --n; return n; #endif } auto count_zeros(uint16_t u) -> uint64_t { return u == 0 ? 0 : 64 - clz(u); } auto count_zeros_branchless(uint16_t u) -> int { // This is always optimized to a cmov return (u != 0) * clz(u); } =================== Gives: without -mlzcnt clz(unsigned long): bsr rax, rdi xor eax, 63 ret count_zeros(unsigned short): xor eax, eax test di, di je .L3 movzx edi, di bsr rdi, rdi lea eax, [rdi+1] .L3: ret and with -mlzcnt clz(unsigned long): xor eax, eax lzcnt rax, rdi ret count_zeros(unsigned short): movzx ecx, di mov edx, 64 xor eax, eax lzcnt rcx, rcx sub edx, ecx test di, di cmovne eax, edx ret which indicates that it avoids the undefined processor always, even if it knows that its result is rejected. This indicates that the choice of using a jump is not driven by performance considerations. (By accident I also discovered that it never uses the cmov if the return type of count_zeros is uint64_t instead of int. I don't think there's a reason for that.)
