https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70138
--- Comment #5 from Jakub Jelinek <jakub at gcc dot gnu.org> --- I've already spent some time on this last night. It fails even when foo is not inlined: double u[1782225]; __attribute__((noinline, noclone)) static void foo (int *x) { double c = 0.0; int a, b; for (a = 0; a < 1335; a++) { for (b = 0; b < 1335; b++) c = c + u[1336 * a]; u[1336 * a] *= 2.0; } *x = c; } int main () { int d, e; for (d = 0; d < 1782225; d++) u[d] = 2.0; foo (&e); if (e != 3564450) __builtin_abort (); return 0; } This is outer loop vectorization, which processes two outer loop iterations at the same time. What I find wrong is at the place where we reduce the result, we emit: <bb 12>: # c_42 = PHI <c_22(6)> # a_44 = PHI <a_17(6)> # ivtmp_47 = PHI <ivtmp_3(6)> # vect_c_10.8_64 = PHI <vect_c_10.8_63(6)> # pretmp_82 = PHI <pretmp_28(6)> vect_c_10.11_65 = VEC_PERM_EXPR <vect_c_10.8_64, { 0.0, 0.0 }, { 1, 2 }>; vect_c_10.11_66 = vect_c_10.11_65 + vect_c_10.8_64; stmp_c_10.10_67 = BIT_FIELD_REF <vect_c_10.11_66, 64, 0>; tmp.5_53 = pretmp_82 * 1.78089e+6; goto <bb 7>; vect_c_10.8_63 holds here the partial sums of c from the first 2 * 667 iterations of the outer loop (low element from even and high element from odd iterations). So, I'd very much expect that stmp_c_10.10_6 would be what is used in the PHI of the final scalar loop, but the stmp_c_10.10_67 SSA_NAME is actually unused and instead tmp.5_53 = pretmp_82 * 1.78089e+6; computes something weird (pretmp_82 at this point is u[1336 * (667 - 1)]). It is true that some optimization for -Ofast could/should figure out that for (b = 0; b < 1335; b++) c = c + u[1336 * a]; is actually c = c + u[1336 * a] * 1335.0;, but that clearly didn't happen until vectorization and while in the testcase all u elements are initially equal, that is not given. As a proof, I've tried to modify in the debugger the value of u[1336 * (667 - 1)] before the call to 6.0 and 32.0 and the final value of e has been roughly scaled accordingly (as the vectorized loop doubles all u elements with indexes that are multiple of 1336, at that point the value is twice the original value, and we then multiply it by 178089 and add the sum from the trailing scalar outer loop.