https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #30 from Jerry DeLisle <jvdelisle at gcc dot gnu.org> --- (In reply to Joost VandeVondele from comment #29) > These slides show how to reach 90% of peak: > http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm/ > the code actually is not too ugly, and I think there is no need for the > explicit vector intrinsics with gcc. The 90% of peak is achieved using SSE registers. I went ahead and built the example and on my laptop (the slow machine) I get about 4.8 gflops with a single core. So we could use this example and back-off from the SSE optimizations to get an internal MATMUL that is not architecture dependent and perhaps leave the rest to external optimized BLAS.