[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

peter at cordes dot ca Sun, 14 Jan 2018 19:23:25 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846


--- Comment #25 from Peter Cordes <peter at cordes dot ca> ---
We're getting a spill/reload inside the loop with AVX512:

    .L2:
        vmovdqa64       (%esp), %zmm3
        vpaddd  (%eax), %zmm3, %zmm2
        addl    $64, %eax
        vmovdqa64       %zmm2, (%esp)
        cmpl    %eax, %edx
        jne     .L2

Loop finishes with the accumulator in memory *and* in ZMM2.  The copy in ZMM2
is ignored, and we get

    # narrow to 32 bytes using memory indexing instead of VEXTRACTI32X8 or
VEXTRACTI64X4
        vmovdqa 32(%esp), %ymm5
        vpaddd  (%esp), %ymm5, %ymm0

    # braindead: vextracti128 can write a new reg instead of destroying xmm0
        vmovdqa %xmm0, %xmm1
        vextracti128    $1, %ymm0, %xmm0
        vpaddd  %xmm0, %xmm1, %xmm0

        ... then a sane 128b hsum as expected, so at least that part went
right.

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

Reply via email to