https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80232
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> --- I have (broadwell CPU) with -Ofast -march=native [-mno-avx] Sparse matmult Mflops: 2481.77 (N=1000, nz=5000) -mno-avx Sparse matmult Mflops: 2043.19 (N=1000, nz=5000) Sparse matmult Mflops: 2248.71 (N=100000, nz=1000000) -mno-avx Sparse matmult Mflops: 1664.08 (N=100000, nz=1000000) for the small system it's the overhead when not taking the vectorized code-path at runtime while for the large it is the overhead when taking the vectorized code-path. With -mno-avx we are not vectorizing the loop. Note broadwell does not yet reach optimal latency/throughput for gathers (2 lanes / cycle saturating the two load ports). I don't have a skylake machine for comparison though.