On 11.04.2014 19:05, Sturla Molden wrote:
> Sturla Molden <[email protected]> wrote:
> 
>> Making a totally new BLAS might seem like a crazy idea, but it might be the
>> best solution in the long run. 
> 
> To see if this can be done, I'll try to re-implement cblas_dgemm and then
> benchmark against MKL, Accelerate and OpenBLAS. If I can get the
> performance better than 75% of their speed, without any assembly or dark
> magic, just plain C99 compiled with Intel icc, that would be sufficient for
> binary wheels on Windows I think.
> 


hi,
if you can, also give gcc with graphite a try. Its loop transformations
should give you similar results as manual blocking if the compiler is
able to understand the loop, see
http://gcc.gnu.org/gcc-4.4/changes.html
-floop-strip-mine
-floop-block
-floop-interchange
+ a couple options to tune the parameters

you may need gcc-4.8 for it to work properly on not compile time fixed
loop iteration counts.
So far i know clang/llvm also has graphite integration.

Cheers,
Julian
_______________________________________________
NumPy-Discussion mailing list
[email protected]
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to