https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70138
--- Comment #8 from Jakub Jelinek <jakub at gcc dot gnu.org> --- Further improved testcase (just decrease number of iterations somewhat, and make sure the u elements that are summed are different in each outer loop iteration, to verify the vectorizer doesn't just multiply value from some iteration by the number of iterations): double u[333 * 333]; __attribute__((noinline, noclone)) static void foo (int *x) { double c = 0.0; int a, b; for (a = 0; a < 333; a++) { for (b = 0; b < 333; b++) c = c + u[334 * a]; u[334 * a] *= 2.0; } *x = c; } int main () { int d, e; for (d = 0; d < 333 * 333; d++) u[d] = 499.0; for (d = 0; d < 333; d++) u[d * 334] = (d + 2); foo (&e); if (e != 333 * (2 + 334) / 2 * 333) __builtin_abort (); return 0; } BTW, I'm really surprised we vectorize this even without -Ofast, it is a double reduction, therefore reducing it causes different floating point operations between the vectorized and non-vectorized cases.