On 27 Aug 2008, at 11:47 am, Vincent Diepeveen wrote:

There is an unwritten recruitment rule, certainly in my field of science, that the programmer "must understand the science", and actually being able to write good code is very much a secondary requirement.

I couldn't disagree more.

Actually, I think we're furiously agreeing with each other.

Maybe your judgement is not objective.

I did say "in my field of science", which is bioinformatics.


For any serious software, let's be objective. Only a few will learn how to program real well and manage to find their way in complex codes. That's for just very few. That usually and not seldom takes a year or 10 to learn. If someone has a PHD or Master or whatever in some science,
he's usually capable of explaining and understanding things.


Becoming a very good low level programmer is a lot harder than learning a few more algorithms that can solve a specific problem. Especially understanding how to program efficiently parallel is not so easy. I spoke with a guy who figured past week some stuff that was used in the 90s at supercomputers, and he concluded it was very inefficient.

Wasn't that exactly my point? That what we need to be hiring are very smart scientists who can design the analysis (check - we've got those), and then a smaller number of very smart programmers who can actually implement them efficiently. And that's precisely what the bioinformatics world hasn't been doing much of up until now; it's just hired the very smart scientists, and not bothered with the programmers.


A single good low level programmer can speedup things not seldom factor 50.

Indeed. There have been several cases where I've improved scientists' code by a factor of 100 or more. My record so far is seven orders of magnitude, for a widely used piece of code from a certain high-profile bioinformatics centre in the US.

According to his opinion: "getting a speedup less than a factor 1000 in scientific number crunching software you can do with your eyes closed".

A slight exaggeration, I think, but if you said a factor of 100 instead, I'd agree completely.

Knowing everything about efficient caching and hashing and how to divide that over the nodes without getting the full latency, nor losing factor 50+ to just MPI messaging, that's just simply a fulltime expertise in itself, and there is far fewer you can find who can do that,
than the huge amount of people who can explain you the field's stuff.

Agreed.

Note bio-informatics is a bad idea to mention, it's eating a grand total of < 0.5% system time at supercomputers and that's already system time that hardly gets used in an efficient manner. There is just not much to calculate there, when compared to math, physics and everything that has to do with the weather from climate in X years from now to earthquake prediction.

I think here you're showing a little old-time HPC bias. There *is* a lot to calculate in bioinformatics, but what's different about it, compared to the other fields that you mention, is the ratio of computation to data. Most bioinformatics code is very data intensive, and that's why it runs so badly on traditional HPC rigs (and as a result you see so little time devoted to it on supercomputers). It's actually a result of exactly what I initially said in this thread; the computations are out there to do, but bioinformatics sites have not generally been hiring people capable of writing code which runs well on traditional HPC systems.

Physics in itself is eating 50%+ of all supercomputer time.

Most large bioinformatics sites have their own clusters, and don't use large national facilities. That's part of the reason you don't see them there. Our sites tend not to appear in the Top 500 either, not because they're not actually powerful enough to be in there, but because they're built differently and don't usually run the required benchmarks that well - usually because we don't have low latency interconnects.

The large quantities of data with, as you say, relatively modest amounts of CPU, make it impractical to use national centres, because we just get swamped in the overhead of shipping data into and out of the facility. We have the problem badly enough within our own LAN, let alone shipping stuff out to remote sites. :-) For example, a single sequencing run on a current technology sequencer is about 2TB of data, which then requires of the order of 24 CPU hours to process. The sequencer then produces another chunk of data the same size two days later. It just wouldn't be practical to shunt that offsite to do.

Tim


--
The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. _______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to