On 11.04.2014 19:05, Sturla Molden wrote: > Sturla Molden <[email protected]> wrote: > >> Making a totally new BLAS might seem like a crazy idea, but it might be the >> best solution in the long run. > > To see if this can be done, I'll try to re-implement cblas_dgemm and then > benchmark against MKL, Accelerate and OpenBLAS. If I can get the > performance better than 75% of their speed, without any assembly or dark > magic, just plain C99 compiled with Intel icc, that would be sufficient for > binary wheels on Windows I think. >
hi, if you can, also give gcc with graphite a try. Its loop transformations should give you similar results as manual blocking if the compiler is able to understand the loop, see http://gcc.gnu.org/gcc-4.4/changes.html -floop-strip-mine -floop-block -floop-interchange + a couple options to tune the parameters you may need gcc-4.8 for it to work properly on not compile time fixed loop iteration counts. So far i know clang/llvm also has graphite integration. Cheers, Julian _______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
