https://gcc.gnu.org/bugzilla/show_bug.cgi?id=27855
--- Comment #55 from Richard Biener <rguenth at gcc dot gnu.org> --- I think the original report is about x87 math vs. SSE math. It's a bit hard to benchmark this through the releases given changes in tuning and vector feature sets (-march=native is out of the question). So I use -O3 -ffast-math -DREPS=100000 -m32 as base and see ISA 4.3.6 4.6.4 4.8.5 7.2 -mno-sse 1855 6930 4618 5623 -msse2 -mfpmath=sse 1967 6945 4744 6472 -m64 2977 6917 4935 6205 note I edited the benchmark and put noinline,noclone attributes on the gemm_atlas function. I benchmarked on a broadwell system with minimal CPU frequency boosting but still varying REPS varies the reported MFLOPS _a lot_ (but individual runs are somewhat stable, for the last reported number 6205 I also can get 6331 or 6186). I used the attached benchmark, the cited URL doesn't work anymore. So there's still an appearant regression, the trunk numbers aren't very different from the 7.2 results, the 4.6.4 variant is still fastest and we recovered to current levels with 4.9.4 already (just checked -m64 across all releases). With -march=native I get to new heights obviously because we use things like FMA, AVX, etc. if I add just -mavx to 4.6.4 it's not faster than without but 7.2 improves to 6628 for example (4.6.4 doesn't know AVX2 and -mfma results in bogus assembler being generated...). If I look at the generated code for -m64 (with just SSE) we no longer spill a lot in the inner loop (only once) and we don't vectorize. 4.6.4 manages to avoid any spilling in the computation (even in the outer loop). So the original analysis (RA sucks) still holds. Note the original report used -O and Aldhy used -O2 but we are talking about a benchmark and when you use -ffast-math you also use -O3. Note the biggest regression we still see is with x87 math - I think we can reasonably disregard that now. The benchmark is somewhat badly written (manually "optimized") so our vectorization attempts fail. Overall conclusion is I'm unsure if it's worth pursuing this bug further? There is a register pressure issue left but the testcase maybe not real-world enough? That is, I would usually recommend to first un-obfuscate the manually optimized code.