[Bug target/82245] New: [x86] missed optimization: (int64_t) i32 << constant on 32-bit machines can combine shift + sign extension like on other arches

peter at cordes dot ca Mon, 18 Sep 2017 23:50:55 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82245


            Bug ID: 82245
           Summary: [x86] missed optimization: (int64_t) i32 << constant
                    on 32-bit machines can combine shift + sign extension
                    like on other arches
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: x86_64-*-*, i?86-*-*

#include <stdint.h>
int64_t shift64(int32_t a) {
    return (int64_t)a << 5;
}

#ifdef __SIZEOF_INT128__
__int128 shift128(int64_t a) {
    return (__int128)a << 5;
}
#endif

// https://godbolt.org/g/HsjpvV
gcc 8.0.0 20170918 -O3

shift128 on x86-64
        movq    %rdi, %r8
        sarq    $63, %rdi
        movq    %r8, %rax     # could have just done cqto after this
        movq    %rdi, %rdx
        shldq   $5, %r8, %rdx
        salq    $5, %rax
        ret

vs. clang 4.0:  (clang -m32 uses gcc's strategy, but -m64 for __int128 is much
better):

         ## I think this is optimal
        movq    %rdi, %rax
        shlq    $5, %rax
        sarq    $59, %rdi           # >>(64-5) to get the upper half of a<<5.
        movq    %rdi, %rdx
        retq


On 32-bit, gcc does somewhat better, using cdq instead of mov + SAR:

shift64:
        pushl   %ebx               # gcc7.x regression to push/pop ebx
        movl    8(%esp), %eax
        popl    %ebx
        cltd
        shldl   $5, %eax, %edx
        sall    $5, %eax
        ret


SHLD r,r,imm is slow-ish on AMD (6 uops 3c latency), but gcc still uses it even
with -march=znver1.  That tuning decision is separate: the optimal choice for
Intel doesn't involve SHLD either for this specific case.

------

This may be an x86-specific missed optimization, since gcc gets it right on
other arches:

shift128:    # gcc6.3 on PowerPC64
        mr 4,3
        sldi 3,3,5
        sradi 4,4,59
        blr

I don't really know PPC64, but I think  mr 4,3 is wasted.  SRADI is a regular
64-bit arithmetic shift with one input and one output
(http://ps-2.kev009.com/tl/techlib/manuals/adoclib/aixassem/alangref/sradi.htm).
It could do
        # hand-optimized for PPC64
        sradi 4,3,59
        sldi 3,3,5
        blr

AArch64 gcc6.3 has the same missed optimization as PowerPC64:

shift128:
        mov     x1, x0             # wasted
        lsl     x0, x0, 5
        asr     x1, x1, 59
        ret

shift64:   # ARM32 gcc6.3 has the same problem
        mov     r1, r0             # wasted
        lsl     r0, r0, #5
        asr     r1, r1, #27
        bx      lr

(Sorry for testing with old gcc on non-x86, but Godbolt only keeps x86
compilers really up to date.  gcc5.4 doesn't combine the shift and
sign-extension even on non-x86.)

gcc6.3 on x86 has the same output as current 8.0 / 7.2, except it avoids the
weird and useless push/pop of %ebx in 32-bit mode.

[Bug target/82245] New: [x86] missed optimization: (int64_t) i32 << constant on 32-bit machines can combine shift + sign extension like on other arches

Reply via email to