Re: [Beowulf] Rant on why HPC isn't as easy as I'd like it to be. [EXT]

Tim Cutts Tue, 21 Sep 2021 05:02:30 -0700

I think that’s exactly the situation we’ve been in for a long time, especially 
in life sciences, and it’s becoming more entrenched.  My experience is that the 
average user of our scientific computing systems has been becoming less 
technically savvy for many years now.


The presence of the cloud makes that more acute, in particular because it makes 
it easy for the user to effectively throw more hardware at the problem, which 
reduces the incentive to make their code particularly fast or efficient.  Cost 
is the only brake on it, and in many cases I’m finding the PI doesn’t actually 
care about that.  They care that a result is being obtained (and it’s time to 
first result they care about, not time to complete all the analysis), and so 
they typically don’t have much time for those of us who are telling them they 
need to invest in time up front developing and optimising efficient code.

And cost is not necessarily the brake I thought it was going to be anyway.  One 
recent project we’ve done on AWS has impressed me a great deal.  It’s not 
terribly CPU efficient, and would doubtless, with sufficient effort, run much 
more efficiently on premise.  But it’s extremely elastic in its nature, and so 
a good fit for the cloud.   Once a week, the project has to completely 
re-analyse the 600,000+ COVID genomes we’e sequenced so far, looking for new 
branches in the phylogenetic tree, and to complete that analysis inside 8 
hours.   Initial attempts to naively convert the HPC implementation to run on 
AWS looked as though they were going to be very expensive (~$50k per weekly 
run).  But a fundamental reworking of the entire workflow to make it as cloud 
native as possible, by which I mean almost exclusively serverless, has 
succeeded beyond what I expected.  The total cost is <$5,000 a month, and 
because there is essentially no statically configured infrastructure at all, 
the security is fairly easy to be comfortable about.  And all of that was done 
with no detailed thinking about whether the actual algorithms running in the 
containers are at all optimised in a traditional HPC sense.  It’s just not 
needed for this particular piece of work.  Did it need software developers with 
hardcore knowledge of performance optimisation?  No.  Was it rapid to develop 
and deploy?  Yes.  Is the performance fast enough for UK national COVID variant 
surveillance?  Yes.  Is it cost effective?  Yes.  Sold!  The one thing it did 
need was knowledgeable cloud architects, but the cloud providers can and do 
help with that.

Tim

--
Tim Cutts
Head of Scientific Computing
Wellcome Sanger Institute


On 21 Sep 2021, at 12:24, John Hearns 
<hear...@gmail.com<mailto:hear...@gmail.com>> wrote:

Some points well made here. I have seen in the past job scripts passed on from 
graduate student to graduate student - the case I am thinking on was an Abaqus 
script for 8 core systems, being run on a new 32 core system. Why WOULD a 
graduate student question a script given to them - which works. They should be 
getting on with their science. I guess this is where Research Software 
Engineers come in.

Another point I would make is about modern processor architectures, for 
instance AMD Rome/Milan. You can have different Numa Per Socket options, which 
affect performance. We set the preferred IO path - which I have seen myself to 
have an effect on latency of MPI messages. IF you are not concerned about your 
hardware layout you would just go ahead and run, missing  a lot of performance.

I am now going to be controversial and common that over in Julia land the 
pattern seems to be these days people develop on their own laptops, or maybe 
local GPU systems. There is a lot of microbenchmarking going on. But there 
seems to be not a lot of thought given to CPU pinning or shat happens with 
hyperthreading. I guess topics like that are part of HPC 'Black Magic' - though 
I would imagine the low latency crowd are hot on them.

I often introduce people to the excellent lstopo/hwloc utilities which show the 
layout of a system. Most people are pleasantly surprised to find this.




-- 
 The Wellcome Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] Rant on why HPC isn't as easy as I'd like it to be. [EXT]

Reply via email to