https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94650
Bug ID: 94650 Summary: Missed x86-64 peephole optimization: x >= large power of two Product: gcc Version: 9.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: pascal_cuoq at hotmail dot com Target Milestone: --- Consider the three functions check, test0 and test1: (Compiler Explorer link: https://gcc.godbolt.org/z/Sh4GpR ) #include <string.h> #define LARGE_POWER_OF_TWO (1UL << 40) int check(unsigned long m) { return m >= LARGE_POWER_OF_TWO; } void g(int); void test0(unsigned long m) { if (m >= LARGE_POWER_OF_TWO) g(0); } void test1(unsigned long m) { if (m >= LARGE_POWER_OF_TWO) g(m); } At least in the case of check and test0, the optimal way to compare m to 1<<40 is to shift m by 40 and compare the result to 0. This is the code generated for these functions by Clang 10: check: # @check xorl %eax, %eax shrq $40, %rdi setne %al retq test0: # @test0 shrq $40, %rdi je .LBB1_1 xorl %edi, %edi jmp g # TAILCALL .LBB1_1: retq In contrast, GCC 9.3 uses a 64-bit constant that needs to be loaded in a register with movabsq: check: movabsq $1099511627775, %rax cmpq %rax, %rdi seta %al movzbl %al, %eax ret test0: movabsq $1099511627775, %rax cmpq %rax, %rdi ja .L5 ret .L5: xorl %edi, %edi jmp g In the case of the function test1 the comparison is between these two version, because the shift is destructive: Clang10: test1: # @test1 movq %rdi, %rax shrq $40, %rax je .LBB2_1 jmp g # TAILCALL .LBB2_1: retq GCC9.3: test1: movabsq $1099511627775, %rax cmpq %rax, %rdi ja .L8 ret .L8: jmp g It is less obvious which approach is better in the case of the function test1, but generally speaking the shift approach should still be faster. The register-register move can be free on Skylake (in the sense of not needing any execution port), whereas movabsq requires an execution port and also it's a 10-byte instruction!