https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049

            Bug ID: 89049
           Summary: [8/9 Regression] Unexpected vectorization
           Product: gcc
           Version: 9.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jakub at gcc dot gnu.org
  Target Milestone: ---

int bar (float *p) { float r = 0; for (int i = 0; i < 1024; ++i) r += p[i];
return r; }
with -O2 -mavx2 -ftree-vectorize starting with r256639 is vectorized as:
.L6:
        vmovups (%rdi), %xmm4
        vinsertf128     $0x1, 16(%rdi), %ymm4, %ymm2
        addq    $32, %rdi
        vaddss  %xmm4, %xmm0, %xmm0
        vshufps $85, %xmm4, %xmm4, %xmm3
        vshufps $255, %xmm4, %xmm4, %xmm1
        vaddss  %xmm3, %xmm0, %xmm0
        vunpckhps       %xmm4, %xmm4, %xmm3
        vaddss  %xmm3, %xmm0, %xmm0
        vaddss  %xmm1, %xmm0, %xmm0
        vextractf128    $0x1, %ymm2, %xmm1
        vshufps $85, %xmm1, %xmm1, %xmm2
        vaddss  %xmm1, %xmm0, %xmm0
        vaddss  %xmm2, %xmm0, %xmm0
        vunpckhps       %xmm1, %xmm1, %xmm2
        vshufps $255, %xmm1, %xmm1, %xmm1
        vaddss  %xmm2, %xmm0, %xmm0
        vaddss  %xmm1, %xmm0, %xmm0
        cmpq    %rdi, %rax
        jne     .L6
The only vector thing in the loop is the vector unaligned load, all the rest
are either extractions from the vector or scalar operations.  At least for -O2
I'd hope we don't do this, I strongly believe scalar loop would be faster, and
if we don't decide to unroll it, even much smaller.  Either the costs are
computed wrongly here, or the vectorizer uses them wrongly.

Reply via email to