https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105968
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |RESOLVED Resolution|--- |WONTFIX --- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- > ./cc1 -quiet t.c -O3 -mavx2 -fopt-info t.c:11:25: optimized: loops interchanged in loop nest > ./cc1 -quiet t.c -O2 -mavx2 -fopt-info t.c:14:19: optimized: loop vectorized using 32 byte vectors so we interchange the loop to for (i = 0; i < N; ++i) for (times = 0; times < NTIMES; times++) r[i] = (a[i] + b[i]) * c[i]; which is indeed good for memory locality (now, we should then eliminate the inner loop completely but we have no such facility - only unrolling and DSE/DCE would do this but nothing on the high-level loop form). "Benchmark" issue. The outer loop should have a memory clobber. Oh, and we should in theory be able to vectorize the outer loop if N is a multiple of the vector element count. But: t.c:11:25: note: === vect_analyze_data_ref_accesses === t.c:11:25: note: zero step in inner loop of nest t.c:11:25: missed: not vectorized: complicated access pattern. t.c:15:14: missed: not vectorized: complicated access pattern. t.c:11:25: missed: bad data access. so we don't handle this exact issue (maybe the offending check can simply be elided - assuming dependence checking handles zero steps correctly). Putting __asm__ volatile ("" : : : "memory"); at the end of the outer loop vectorizes with -O3 as well (but doesn't interchange). Not a bug I think unless you want to make it a bug about not vectorizing the outer loop after interchange.