https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625
--- Comment #4 from Chris Elrod <elrodc at gmail dot com> --- Created attachment 44423 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44423&action=edit 8x16 * 16x6 kernel for avx2. Here is a scaled down version to reproduce most of the the problem for avx2-capable architectures. I just used march=haswell, but I think most recent architectures fall under this. For some, like zenv1, you may need to add -mprefer-vector-width=256. To get the inefficiently vectorized loop: gfortran -march=haswell -Ofast -shared -fPIC -S kernelsavx2.f90 -o kernelsavx2bad.s To get only the unnecessary loads/stores, use: gfortran -march=haswell -O2 -ftree-vectorize -shared -fPIC -S kernelsavx2.f90 -o kernelsavx2.s This file compiles instantly, while with `O3` the other one can take a couple seconds. However while it does `vmovapd` between registers, it no longer spills into the stack in the manually unrolled version, like the avx512 kernel does.