On Tue, Apr 29, 2014 at 1:05 AM, Matthew Brett <[email protected]> wrote: > Hi, > > On Mon, Apr 28, 2014 at 4:30 PM, Nathaniel Smith <[email protected]> wrote: >> On Mon, Apr 28, 2014 at 11:25 AM, Michael Lehn <[email protected]> >> wrote: >>> >>> Am 11 Apr 2014 um 19:05 schrieb Sturla Molden <[email protected]>: >>> >>>> Sturla Molden <[email protected]> wrote: >>>> >>>>> Making a totally new BLAS might seem like a crazy idea, but it might be >>>>> the >>>>> best solution in the long run. >>>> >>>> To see if this can be done, I'll try to re-implement cblas_dgemm and then >>>> benchmark against MKL, Accelerate and OpenBLAS. If I can get the >>>> performance better than 75% of their speed, without any assembly or dark >>> >>> So what percentage on performance did you achieve so far? >> >> I finally read this paper: >> >> http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev2.pdf >> >> and I have to say that I'm no longer so convinced that OpenBLAS is the >> right starting point. They make a compelling argument that BLIS *is* >> the cleaned up, maintainable, and yet still competitive >> reimplementation of GotoBLAS/OpenBLAS that we all want, and that >> getting there required a qualitative reorganization of the code (i.e., >> very hard to do incrementally). But they've done it. And, I get the >> impression that the stuff they're missing -- threading, cross-platform >> build stuff, and runtime CPU adaptation -- is all pretty >> straightforward stuff that is only missing because no-one's gotten >> around to sitting down and implementing it. (In particular that paper >> does include impressive threading results; it sounds like given a >> decent thread pool library one could get competitive performance >> pretty trivially, it's just that they haven't been bothered yet to do >> thread pools properly or systematically test which of the pretty-good >> approaches to threading is "best". Which is important if your goal is >> to write papers about BLAS libraries but irrelevant to reaching >> minimal-viable-product stage.) >> >> It would be really interesting if someone were to try hacking simple >> runtime CPU detection into BLIS and see how far you could get -- right >> now they do kernel selection via the C preprocessor, but hacking in >> some function pointer thing instead would not be that hard I think. A >> maintainable library that builds on Linux/OSX/Windows, gets >> competitive performance on last-but-one generation x86-64 CPUs, and >> gets better-than-reference-BLAS performance everywhere else, would be >> a very very compelling product that I bet would quickly attract the >> necessary attention to make it competitive on all CPUs. > > I wonder - is there anyone who might be able to do this work, if we > found funding for a couple of months to do it?
Not much point in worrying about this I think until someone tries a proof of concept. But potentially even the labs working on BLIS would be interested in a small grant from NumFOCUS or something. -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org _______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
