https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600
--- Comment #8 from Jerry DeLisle <jvdelisle at gcc dot gnu.org> --- Created attachment 36887 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36887&action=edit A faster version I took the example code found in http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm/ where the register based vector computations are explicitly called via the SSE registers and converted it to use the builtin gcc vector extensions. I had to experiment a little to get some of the equivalent operations of the original code. With only -O2 and march=native I am getting good results. I need to roll this into the other test program yet to confirm the gflops are being computed correctly. The diff value is comparing to the reference naive results to check the computation is correct. MY_MMult = [ Size: 40, Gflops: 1.828571e+00, Diff: 2.664535e-15 Size: 80, Gflops: 3.696751e+00, Diff: 7.105427e-15 Size: 120, Gflops: 4.051583e+00, Diff: 1.065814e-14 Size: 160, Gflops: 4.015686e+00, Diff: 1.421085e-14 Size: 200, Gflops: 4.029212e+00, Diff: 2.131628e-14 Size: 240, Gflops: 3.972414e+00, Diff: 2.486900e-14 Size: 280, Gflops: 3.881188e+00, Diff: 2.842171e-14 Size: 320, Gflops: 3.872371e+00, Diff: 3.552714e-14 Size: 360, Gflops: 3.887676e+00, Diff: 4.973799e-14 Size: 400, Gflops: 3.862052e+00, Diff: 4.973799e-14 Size: 440, Gflops: 3.886575e+00, Diff: 4.973799e-14 Size: 480, Gflops: 3.910124e+00, Diff: 6.039613e-14 Size: 520, Gflops: 3.863706e+00, Diff: 6.394885e-14 Size: 560, Gflops: 3.976947e+00, Diff: 6.750156e-14 Size: 600, Gflops: 4.002631e+00, Diff: 7.460699e-14 Size: 640, Gflops: 3.992507e+00, Diff: 8.171241e-14 Size: 680, Gflops: 3.964570e+00, Diff: 9.237056e-14 Size: 720, Gflops: 3.973661e+00, Diff: 1.101341e-13 Size: 760, Gflops: 3.982346e+00, Diff: 1.065814e-13 Size: 800, Gflops: 3.869291e+00, Diff: 8.881784e-14 Size: 840, Gflops: 3.936271e+00, Diff: 1.065814e-13 Size: 880, Gflops: 3.931259e+00, Diff: 1.030287e-13 Size: 920, Gflops: 3.912907e+00, Diff: 1.207923e-13 Size: 960, Gflops: 3.938391e+00, Diff: 1.278977e-13 Size: 1000, Gflops: 3.945754e+00, Diff: 1.421085e-13