https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113543
Bug ID: 113543 Summary: Poor codegen for bit-counting functions (countl_zero, countl_one, countr_zero, countr_one) Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: janschultke at googlemail dot com Target Milestone: --- ## Code to Reproduce (https://godbolt.org/z/qPeszhaPv) #include <bit> template <typename T> T countr_zero(T x) { return std::countr_zero(x); } template unsigned char countr_zero(unsigned char); template unsigned short countr_zero(unsigned short); template unsigned int countr_zero(unsigned int); template unsigned long countr_zero(unsigned long); template unsigned long long countr_zero(unsigned long long); template <typename T> T countr_one(T x) { return std::countr_one(x); } template unsigned char countr_one(unsigned char); template unsigned short countr_one(unsigned short); template unsigned int countr_one(unsigned int); template unsigned long countr_one(unsigned long); template unsigned long long countr_one(unsigned long long); template <typename T> T countl_zero(T x) { return std::countl_zero(x); } template unsigned char countl_zero(unsigned char); template unsigned short countl_zero(unsigned short); template unsigned int countl_zero(unsigned int); template unsigned long countl_zero(unsigned long); template unsigned long long countl_zero(unsigned long long); template <typename T> T countl_one(T x) { return std::countl_zero(x); } template unsigned char countl_one(unsigned char); template unsigned short countl_one(unsigned short); template unsigned int countl_one(unsigned int); template unsigned long countl_one(unsigned long); template unsigned long long countl_one(unsigned long long); ## Summary GCC consistently emits much more code for these function than clang. For example, GCC: > unsigned int countl_one<unsigned int>(unsigned int): > xor eax, eax > lzcnt eax, edi > ret Clang does not emit the extra xor instruction. I don't really know why. LZCNT has a wide contract and should be equivalent to std::countl_zero. It gets a lot worse though: > unsigned short countl_zero<unsigned short>(unsigned short): > mov eax, 16 > test di, di > je .L23 > movzx edi, di > lzcnt edi, edi > lea eax, [rdi-16] > .L23: > ret I don't really know what all of this schmutz is. Clang emits lzcnt and ret in this case. Another bit of disappointing codegen is this: > unsigned char countr_zero<unsigned char>(unsigned char): > movzx eax, dil > xor edx, edx > tzcnt edx, eax > test dil, dil > mov eax, 8 > cmovne eax, edx > ret Clang emits: > or edi, 256 > tzcnt eax, edi > ret This clang codegen is very clever. It simply adds a bit on the left, so that the 32-bit routine can be re-used with only one additional instruction.