3x assembly and running 10x slower than manual complete unrolling

elrodc at gmail dot com Mon, 23 Jul 2018 07:09:35 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625


--- Comment #4 from Chris Elrod <elrodc at gmail dot com> ---
Created attachment 44423
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44423&action=edit
8x16 * 16x6 kernel for avx2.

Here is a scaled down version to reproduce most of the the problem for
avx2-capable architectures.
I just used march=haswell, but I think most recent architectures fall under
this.
For some, like zenv1, you may need to add -mprefer-vector-width=256.


To get the inefficiently vectorized loop:

gfortran -march=haswell -Ofast -shared -fPIC -S kernelsavx2.f90 -o
kernelsavx2bad.s

To get only the unnecessary loads/stores, use:

gfortran -march=haswell -O2 -ftree-vectorize -shared -fPIC -S kernelsavx2.f90
-o kernelsavx2.s

This file compiles instantly, while with `O3` the other one can take a couple
seconds.
However while it does `vmovapd` between registers, it no longer spills into the
stack in the manually unrolled version, like the avx512 kernel does.

[Bug tree-optimization/86625] funroll-loops doesn't unroll, producing >3x assembly and running 10x slower than manual complete unrolling

Reply via email to