https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82245
Bug ID: 82245 Summary: [x86] missed optimization: (int64_t) i32 << constant on 32-bit machines can combine shift + sign extension like on other arches Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* #include <stdint.h> int64_t shift64(int32_t a) { return (int64_t)a << 5; } #ifdef __SIZEOF_INT128__ __int128 shift128(int64_t a) { return (__int128)a << 5; } #endif // https://godbolt.org/g/HsjpvV gcc 8.0.0 20170918 -O3 shift128 on x86-64 movq %rdi, %r8 sarq $63, %rdi movq %r8, %rax # could have just done cqto after this movq %rdi, %rdx shldq $5, %r8, %rdx salq $5, %rax ret vs. clang 4.0: (clang -m32 uses gcc's strategy, but -m64 for __int128 is much better): ## I think this is optimal movq %rdi, %rax shlq $5, %rax sarq $59, %rdi # >>(64-5) to get the upper half of a<<5. movq %rdi, %rdx retq On 32-bit, gcc does somewhat better, using cdq instead of mov + SAR: shift64: pushl %ebx # gcc7.x regression to push/pop ebx movl 8(%esp), %eax popl %ebx cltd shldl $5, %eax, %edx sall $5, %eax ret SHLD r,r,imm is slow-ish on AMD (6 uops 3c latency), but gcc still uses it even with -march=znver1. That tuning decision is separate: the optimal choice for Intel doesn't involve SHLD either for this specific case. ------ This may be an x86-specific missed optimization, since gcc gets it right on other arches: shift128: # gcc6.3 on PowerPC64 mr 4,3 sldi 3,3,5 sradi 4,4,59 blr I don't really know PPC64, but I think mr 4,3 is wasted. SRADI is a regular 64-bit arithmetic shift with one input and one output (http://ps-2.kev009.com/tl/techlib/manuals/adoclib/aixassem/alangref/sradi.htm). It could do # hand-optimized for PPC64 sradi 4,3,59 sldi 3,3,5 blr AArch64 gcc6.3 has the same missed optimization as PowerPC64: shift128: mov x1, x0 # wasted lsl x0, x0, 5 asr x1, x1, 59 ret shift64: # ARM32 gcc6.3 has the same problem mov r1, r0 # wasted lsl r0, r0, #5 asr r1, r1, #27 bx lr (Sorry for testing with old gcc on non-x86, but Godbolt only keeps x86 compilers really up to date. gcc5.4 doesn't combine the shift and sign-extension even on non-x86.) gcc6.3 on x86 has the same output as current 8.0 / 7.2, except it avoids the weird and useless push/pop of %ebx in 32-bit mode.