https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90579
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> --- A workaround on your side would be to stick -mprefer-vector-size=128 on the function. Note the actual code seems to have another loop touching r inbetween: for (i=0;i<6;i++) { r[i] = x1*toverp[k+i]*gor.x; gor.x *= tm24.x; } for (i=0;i<3;i++) { s=(r[i]+big.x)-big.x; sum+=s; r[i]-=s; } t=0; for (i=0;i<6;i++) t+=r[5-i]; Note that placing a "real" VN after vectorization fixes things as well. The issue is that we end up with MEM[(double *)&r] = vect__3.15_64; ... _3 = r[1]; t_17 = _3 + t_12; _6 = r[0]; t_28 = _6 + t_17; return t_28; where both loads could be CSEd via vector extracts. @@ -306,6 +306,7 @@ along with GCC; see the file COPYING3. NEXT_PASS (pass_simduid_cleanup); NEXT_PASS (pass_lower_vector_ssa); NEXT_PASS (pass_lower_switch); + NEXT_PASS (pass_fre); NEXT_PASS (pass_cse_reciprocals); NEXT_PASS (pass_sprintf_length, true); NEXT_PASS (pass_reassoc, false /* insert_powi_p */); produces MEM[(double *)&r] = vect__3.15_64; ... _75 = BIT_FIELD_REF <vect__3.15_64, 64, 64>; t_17 = t_12 + _75; _74 = BIT_FIELD_REF <vect__3.15_64, 64, 0>; t_28 = t_17 + _74; return t_28; and loop: .LFB0: .cfi_startproc movslq %edi, %rax vmovapd %xmm0, %xmm1 addl $4, %edi movslq %edi, %rdi vbroadcastsd %xmm0, %ymm0 vmovddup %xmm1, %xmm1 vmulpd a(,%rax,8), %ymm0, %ymm0 vmulpd a(,%rdi,8), %xmm1, %xmm1 vmovupd %ymm0, r(%rip) vmovups %xmm1, r+32(%rip) vmovupd r+16(%rip), %ymm2 vextractf128 $0x1, %ymm2, %xmm3 vunpckhpd %xmm3, %xmm3, %xmm1 vaddsd .LC0(%rip), %xmm1, %xmm1 vaddsd %xmm3, %xmm1, %xmm1 vunpckhpd %xmm2, %xmm2, %xmm3 vaddsd %xmm3, %xmm1, %xmm1 vaddsd %xmm2, %xmm1, %xmm1 vunpckhpd %xmm0, %xmm0, %xmm2 vaddsd %xmm1, %xmm2, %xmm1 vaddsd %xmm1, %xmm0, %xmm0 vzeroupper ret but another FRE pass was thought to be too costly since we cannot simply replace the DOM pass that performs VN around this place (since that also performs jump threading which we'd then miss). I did not assess the compile-time cost or other fallout adding a new FRE pass after the VN rewrite (we could use a non-iterating FRE here). I did at some point in the past but it wasn't an obvious improvement (no SPEC improvements IIRC but ~2-3% compile-time slowdown).