Hi, On Mon, Apr 28, 2014 at 5:50 PM, Nathaniel Smith <[email protected]> wrote: > On Tue, Apr 29, 2014 at 1:05 AM, Matthew Brett <[email protected]> > wrote: >> Hi, >> >> On Mon, Apr 28, 2014 at 4:30 PM, Nathaniel Smith <[email protected]> wrote: >>> On Mon, Apr 28, 2014 at 11:25 AM, Michael Lehn <[email protected]> >>> wrote: >>>> >>>> Am 11 Apr 2014 um 19:05 schrieb Sturla Molden <[email protected]>: >>>> >>>>> Sturla Molden <[email protected]> wrote: >>>>> >>>>>> Making a totally new BLAS might seem like a crazy idea, but it might be >>>>>> the >>>>>> best solution in the long run. >>>>> >>>>> To see if this can be done, I'll try to re-implement cblas_dgemm and then >>>>> benchmark against MKL, Accelerate and OpenBLAS. If I can get the >>>>> performance better than 75% of their speed, without any assembly or dark >>>> >>>> So what percentage on performance did you achieve so far? >>> >>> I finally read this paper: >>> >>> http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev2.pdf >>> >>> and I have to say that I'm no longer so convinced that OpenBLAS is the >>> right starting point. They make a compelling argument that BLIS *is* >>> the cleaned up, maintainable, and yet still competitive >>> reimplementation of GotoBLAS/OpenBLAS that we all want, and that >>> getting there required a qualitative reorganization of the code (i.e., >>> very hard to do incrementally). But they've done it. And, I get the >>> impression that the stuff they're missing -- threading, cross-platform >>> build stuff, and runtime CPU adaptation -- is all pretty >>> straightforward stuff that is only missing because no-one's gotten >>> around to sitting down and implementing it. (In particular that paper >>> does include impressive threading results; it sounds like given a >>> decent thread pool library one could get competitive performance >>> pretty trivially, it's just that they haven't been bothered yet to do >>> thread pools properly or systematically test which of the pretty-good >>> approaches to threading is "best". Which is important if your goal is >>> to write papers about BLAS libraries but irrelevant to reaching >>> minimal-viable-product stage.) >>> >>> It would be really interesting if someone were to try hacking simple >>> runtime CPU detection into BLIS and see how far you could get -- right >>> now they do kernel selection via the C preprocessor, but hacking in >>> some function pointer thing instead would not be that hard I think. A >>> maintainable library that builds on Linux/OSX/Windows, gets >>> competitive performance on last-but-one generation x86-64 CPUs, and >>> gets better-than-reference-BLAS performance everywhere else, would be >>> a very very compelling product that I bet would quickly attract the >>> necessary attention to make it competitive on all CPUs. >> >> I wonder - is there anyone who might be able to do this work, if we >> found funding for a couple of months to do it? > > Not much point in worrying about this I think until someone tries a > proof of concept. But potentially even the labs working on BLIS would > be interested in a small grant from NumFOCUS or something.
The problem is the time and mental energy involved in the proof-of-concept may be enough to prevent it being done, and having some money to pay for time and to placate employers may be useful in overcoming that. To be clear - not me - I will certainly help if I can, but being paid isn't going to help me work on this. Cheers, Matthew _______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
