https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82298
Bug ID: 82298 Summary: x86 BMI: no peephole for BZHI Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* gcc never seems to emit BZHI on its own. // exact BZHI behaviour for all inputs (with no C UB) unsigned bzhi_exact(unsigned x, unsigned c) { c &= 0xff; if (c <= 31) { x &= ((1U << c) - 1); // 1ULL defeats clang's peephole, but is a convenient way to avoid UB for count=32. } return x; } // https://godbolt.org/g/tZKnV3 unsigned long bzhi_l(unsigned long x, unsigned c) { return x & ((1UL << c) - 1); } Out-of-range shift UB allows peepholing to BZHI for the simpler case, so these (respectively) should compile to bzhil %esi, %edi, %edi bzhiq %rsi, %rdi, %rax But we actually get (gcc8 -O3 -march=haswell (-mbmi2)) movq $-1, %rax shlx %rsi, %rax, %rdx andn %rdi, %rdx, %rax ret Or that with a test&branch for bzhi_exact. Clang succeeds at peepholing BZHI here, but it still does the &0xff and the test&branch to skip BZHI when it would do nothing. It's easy to imagine cases where the source would use a conditional to avoid UB when it wants to leave x unmodified for c==32, and the range is 1 to 32: unsigned bzhi_1_to_32(unsigned x, unsigned c) { if (c != 32) x &= ((1U << c) - 1); return x; } BZHI is defined to saturate the index to OperandSize, so it copies src1 unmodified when the low 8 bits of src2 are >= 32 or >= 64. (See the Operation section of http://felixcloutier.com/x86/BZHI.html. The text description is wrong, claiming it saturates to OperandSize-1, which would zero the high bit.) Other ways to express it (which clang fails to peephole to BZHI, like gcc): unsigned bzhi2(unsigned x, unsigned c) { // c &= 0xff; // if(c < 32) { x &= (0xFFFFFFFFUL >> (32-c)); // } return x; } unsigned bzhi3(unsigned long x, unsigned c) { // c &= 0xff; return x & ~(-1U << c); } Related: pr65871 suggested this, but was really about taking advantage of flags set by __builtin_ia32_bzhi_si so it is correctly closed. pr66872 suggested transforming x & ((1 << t) - 1); to x & ~(-1 << t); to enable ANDN. Compiling both to BZHI when BMI2 is available was mentioned, but the the main subject of that bug either.