https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049
Bug ID: 89049 Summary: [8/9 Regression] Unexpected vectorization Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: jakub at gcc dot gnu.org Target Milestone: --- int bar (float *p) { float r = 0; for (int i = 0; i < 1024; ++i) r += p[i]; return r; } with -O2 -mavx2 -ftree-vectorize starting with r256639 is vectorized as: .L6: vmovups (%rdi), %xmm4 vinsertf128 $0x1, 16(%rdi), %ymm4, %ymm2 addq $32, %rdi vaddss %xmm4, %xmm0, %xmm0 vshufps $85, %xmm4, %xmm4, %xmm3 vshufps $255, %xmm4, %xmm4, %xmm1 vaddss %xmm3, %xmm0, %xmm0 vunpckhps %xmm4, %xmm4, %xmm3 vaddss %xmm3, %xmm0, %xmm0 vaddss %xmm1, %xmm0, %xmm0 vextractf128 $0x1, %ymm2, %xmm1 vshufps $85, %xmm1, %xmm1, %xmm2 vaddss %xmm1, %xmm0, %xmm0 vaddss %xmm2, %xmm0, %xmm0 vunpckhps %xmm1, %xmm1, %xmm2 vshufps $255, %xmm1, %xmm1, %xmm1 vaddss %xmm2, %xmm0, %xmm0 vaddss %xmm1, %xmm0, %xmm0 cmpq %rdi, %rax jne .L6 The only vector thing in the loop is the vector unaligned load, all the rest are either extractions from the vector or scalar operations. At least for -O2 I'd hope we don't do this, I strongly believe scalar loop would be faster, and if we don't decide to unroll it, even much smaller. Either the costs are computed wrongly here, or the vectorizer uses them wrongly.