https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #38 from Thomas Koenig <tkoenig at gcc dot gnu.org> --- (In reply to Joost VandeVondele from comment #37) > (In reply to Joost VandeVondele from comment #36) > > #pragma GCC optimize ( "-Ofast -fvariable-expansion-in-unroller > > -funroll-loops" ) > > and really beneficial for larger matrices would be > > -floop-nest-optimize > > in particular the blocking (it would be an additional motivation for PR14741 > and work on graphite in general), don't know if one can give the parameter > for the blocking. In principle the loop-nest-optimization, together with the > -Ofast (and ideally -march=native, which we can't have in libgfortran, I > assume) would yield near peak performance. The algorithm that Jerry implemented already has a very nice unrolling/ blocking algorithm. I doubt that the gcc algorithms can add to that. Regarding -march=native, that could really be an improvement, especially with -mavx. I wonder if it is possible to have architecture-specific versions of library functions? We could select the right routine depending on the -march flag. Worth a question on the gcc list, probably (but definitely _not_ a prerequisite for this going into gcc 7). Of course, we _could_ also try to bring blocking to the inline version (PR 66189), risking insanity for the implementer :-) Jerry, what Netlib code were you basing your code on?