https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90579
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
A workaround on your side would be to stick -mprefer-vector-size=128 on the
function.
Note the actual code seems to have another loop touching r inbetween:
for (i=0;i<6;i++)
{ r[i] = x1*toverp[k+i]*gor.x; gor.x *= tm24.x; }
for (i=0;i<3;i++) {
s=(r[i]+big.x)-big.x;
sum+=s;
r[i]-=s;
}
t=0;
for (i=0;i<6;i++)
t+=r[5-i];
Note that placing a "real" VN after vectorization fixes things as well.
The issue is that we end up with
MEM[(double *)&r] = vect__3.15_64;
...
_3 = r[1];
t_17 = _3 + t_12;
_6 = r[0];
t_28 = _6 + t_17;
return t_28;
where both loads could be CSEd via vector extracts.
@@ -306,6 +306,7 @@ along with GCC; see the file COPYING3.
NEXT_PASS (pass_simduid_cleanup);
NEXT_PASS (pass_lower_vector_ssa);
NEXT_PASS (pass_lower_switch);
+ NEXT_PASS (pass_fre);
NEXT_PASS (pass_cse_reciprocals);
NEXT_PASS (pass_sprintf_length, true);
NEXT_PASS (pass_reassoc, false /* insert_powi_p */);
produces
MEM[(double *)&r] = vect__3.15_64;
...
_75 = BIT_FIELD_REF <vect__3.15_64, 64, 64>;
t_17 = t_12 + _75;
_74 = BIT_FIELD_REF <vect__3.15_64, 64, 0>;
t_28 = t_17 + _74;
return t_28;
and
loop:
.LFB0:
.cfi_startproc
movslq %edi, %rax
vmovapd %xmm0, %xmm1
addl $4, %edi
movslq %edi, %rdi
vbroadcastsd %xmm0, %ymm0
vmovddup %xmm1, %xmm1
vmulpd a(,%rax,8), %ymm0, %ymm0
vmulpd a(,%rdi,8), %xmm1, %xmm1
vmovupd %ymm0, r(%rip)
vmovups %xmm1, r+32(%rip)
vmovupd r+16(%rip), %ymm2
vextractf128 $0x1, %ymm2, %xmm3
vunpckhpd %xmm3, %xmm3, %xmm1
vaddsd .LC0(%rip), %xmm1, %xmm1
vaddsd %xmm3, %xmm1, %xmm1
vunpckhpd %xmm2, %xmm2, %xmm3
vaddsd %xmm3, %xmm1, %xmm1
vaddsd %xmm2, %xmm1, %xmm1
vunpckhpd %xmm0, %xmm0, %xmm2
vaddsd %xmm1, %xmm2, %xmm1
vaddsd %xmm1, %xmm0, %xmm0
vzeroupper
ret
but another FRE pass was thought to be too costly since we cannot simply
replace the DOM pass that performs VN around this place (since that also
performs jump threading which we'd then miss).
I did not assess the compile-time cost or other fallout adding a new
FRE pass after the VN rewrite (we could use a non-iterating FRE here).
I did at some point in the past but it wasn't an obvious improvement
(no SPEC improvements IIRC but ~2-3% compile-time slowdown).