http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #3 from Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> 2011-11-15 12:19:59 UTC --- (In reply to comment #1) > I have a cunning plan. It is doable to come within a factor of 2 of highly efficient implementations using a cache-oblivious matrix multiply, which is relatively easy to code. I'm not sure this is worth the effort. I believe it would be more important to have actually highly efficient (inlined) implementations for very small matrices. These would outperform general libraries by a large factor. For CP2K I have written a specialized small matrix multiply library generator which generates code that outperforms e.g. MKL by a large factor for small matrices (<<32x32). The generation time and library size do not make it a general purpose tool. It also contains an implementation of the recursive multiply of some sort (see http://cvs.berlios.de/cgi-bin/viewvc.cgi/cp2k/cp2k/tools/build_libsmm/)