Re: [Beowulf] Stroustrup regarding multicore

Tim Cutts Thu, 28 Aug 2008 08:49:23 -0700


On 27 Aug 2008, at 11:47 am, Vincent Diepeveen wrote:

There is an unwritten recruitment rule, certainly in my field ofscience, that the programmer "must understand the science", andactually being able to write good code is very much a secondaryrequirement.
I couldn't disagree more.


Actually, I think we're furiously agreeing with each other.

Maybe your judgement is not objective.


I did say "in my field of science", which is bioinformatics.

For any serious software, let's be objective. Only a few will learnhow to program real well and manage to find their way in complexcodes.That's for just very few. That usually and not seldom takes a yearor 10 to learn. If someone has a PHD or Master or whatever in somescience,
he's usually capable of explaining and understanding things.

Becoming a very good low level programmer is a lot harder thanlearning a few more algorithms that can solve a specific problem.Especially understanding how to program efficiently parallel is notso easy. I spoke with a guy who figured past week some stuffthat was used in the 90s at supercomputers, and he concluded it wasvery inefficient.

Wasn't that exactly my point? That what we need to be hiring are verysmart scientists who can design the analysis (check - we've gotthose), and then a smaller number of very smart programmers who canactually implement them efficiently. And that's precisely what thebioinformatics world hasn't been doing much of up until now; it's justhired the very smart scientists, and not bothered with the programmers.

A single good low level programmer can speedup things not seldomfactor 50.

Indeed. There have been several cases where I've improved scientists'code by a factor of 100 or more. My record so far is seven orders ofmagnitude, for a widely used piece of code from a certain high-profilebioinformatics centre in the US.

According to his opinion: "getting a speedup less than a factor 1000in scientific number crunching software you can do with your eyesclosed".

A slight exaggeration, I think, but if you said a factor of 100instead, I'd agree completely.

Knowing everything about efficient caching and hashing and how todivide that over the nodes without getting the full latency, norlosingfactor 50+ to just MPI messaging, that's just simply a fulltimeexpertise in itself, and there is far fewer you can find who can dothat,
than the huge amount of people who can explain you the field's stuff.


Agreed.

Note bio-informatics is a bad idea to mention, it's eating a grandtotal of < 0.5% system time at supercomputersand that's already system time that hardly gets used in an efficientmanner. There is just not much to calculate there,when compared to math, physics and everything that has to do withthe weather from climate in X years from now to earthquake prediction.

I think here you're showing a little old-time HPC bias. There *is* alot to calculate in bioinformatics, but what's different about it,compared to the other fields that you mention, is the ratio ofcomputation to data. Most bioinformatics code is very data intensive,and that's why it runs so badly on traditional HPC rigs (and as aresult you see so little time devoted to it on supercomputers). It'sactually a result of exactly what I initially said in this thread; thecomputations are out there to do, but bioinformatics sites have notgenerally been hiring people capable of writing code which runs wellon traditional HPC systems.

Physics in itself is eating 50%+ of all supercomputer time.

Most large bioinformatics sites have their own clusters, and don't uselarge national facilities. That's part of the reason you don't seethem there. Our sites tend not to appear in the Top 500 either, notbecause they're not actually powerful enough to be in there, butbecause they're built differently and don't usually run the requiredbenchmarks that well - usually because we don't have low latencyinterconnects.

The large quantities of data with, as you say, relatively modestamounts of CPU, make it impractical to use national centres, becausewe just get swamped in the overhead of shipping data into and out ofthe facility. We have the problem badly enough within our own LAN,let alone shipping stuff out to remote sites. :-) For example, asingle sequencing run on a current technology sequencer is about 2TBof data, which then requires of the order of 24 CPU hours to process.The sequencer then produces another chunk of data the same size twodays later. It just wouldn't be practical to shunt that offsite to do.


Tim


--

The Wellcome Trust Sanger Institute is operated by Genome ResearchLimited, a charity registered in England with number 1021457 and acompany registered in England with number 2742969, whose registeredoffice is 215 Euston Road, London, NW1 2BE._______________________________________________

Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Stroustrup regarding multicore

Reply via email to