https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049
Bug ID: 89049
Summary: [8/9 Regression] Unexpected vectorization
Product: gcc
Version: 9.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: jakub at gcc dot gnu.org
Target Milestone: ---
int bar (float *p) { float r = 0; for (int i = 0; i < 1024; ++i) r += p[i];
return r; }
with -O2 -mavx2 -ftree-vectorize starting with r256639 is vectorized as:
.L6:
vmovups (%rdi), %xmm4
vinsertf128 $0x1, 16(%rdi), %ymm4, %ymm2
addq $32, %rdi
vaddss %xmm4, %xmm0, %xmm0
vshufps $85, %xmm4, %xmm4, %xmm3
vshufps $255, %xmm4, %xmm4, %xmm1
vaddss %xmm3, %xmm0, %xmm0
vunpckhps %xmm4, %xmm4, %xmm3
vaddss %xmm3, %xmm0, %xmm0
vaddss %xmm1, %xmm0, %xmm0
vextractf128 $0x1, %ymm2, %xmm1
vshufps $85, %xmm1, %xmm1, %xmm2
vaddss %xmm1, %xmm0, %xmm0
vaddss %xmm2, %xmm0, %xmm0
vunpckhps %xmm1, %xmm1, %xmm2
vshufps $255, %xmm1, %xmm1, %xmm1
vaddss %xmm2, %xmm0, %xmm0
vaddss %xmm1, %xmm0, %xmm0
cmpq %rdi, %rax
jne .L6
The only vector thing in the loop is the vector unaligned load, all the rest
are either extractions from the vector or scalar operations. At least for -O2
I'd hope we don't do this, I strongly believe scalar loop would be faster, and
if we don't decide to unroll it, even much smaller. Either the costs are
computed wrongly here, or the vectorizer uses them wrongly.