https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80232

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
I have (broadwell CPU) with -Ofast -march=native [-mno-avx]

Sparse matmult  Mflops:  2481.77    (N=1000, nz=5000)  -mno-avx
Sparse matmult  Mflops:  2043.19    (N=1000, nz=5000)
Sparse matmult  Mflops:  2248.71    (N=100000, nz=1000000)  -mno-avx
Sparse matmult  Mflops:  1664.08    (N=100000, nz=1000000)

for the small system it's the overhead when not taking the vectorized code-path
at runtime while for the large it is the overhead when taking the vectorized
code-path.  With -mno-avx we are not vectorizing the loop.

Note broadwell does not yet reach optimal latency/throughput for gathers
(2 lanes / cycle saturating the two load ports).  I don't have a skylake
machine for comparison though.

Reply via email to