On 4 Feb 2021, at 10:40, Jonathan Aquilina 
<jaquil...@eagleeyet.net<mailto:jaquil...@eagleeyet.net>> wrote:

Maybe SETI@home wasnt the right project to mention, just remembered there is 
another project but not in genomics on that distributed platform called 
Folding@home.

Right, protein dynamics simulations like that are at the other end of the 
data/compute ratio spectrum.  Very suitable for distributed computing in that 
sort of way.

So with genomics you cannot break it down into smaller chunks where the data 
can be crunched then returned to sender and then processed once the data is 
back or as its being received?

It depends on what you’re doing.  If you already know the reference genome 
then, yes you can.  We already do this to some extent; the reads from the 
sequencing run are de-multiplexed first, and then the reads for each sample are 
processed as a separate embarrassingly parallel job.  This is basically doing a 
jigsaw puzzle when you know the picture.

The read alignment to reference (if you already have a standard reference 
genome) easily decomposable as much as you like, right down to a single read in 
the extreme case, but the compute for a single read is tiny (this is basically 
fuzzy grep going on here),  and you’d be swamped in scheduling overhead.  For 
maximum throughput we don’t bother distributing it further, but use 
multithreading on a single node.

There have been some interesting distributed mapping attempts, for example 
decomposing the problem into read groups small enough to fit in the time limit 
of an AWS lambda function.  You get fabulous turnaround time on the analysis if 
you do that, but you use about four times as much actual compute time as the 
single node, multi-thread approach we currently use. (reference to the lambda 
work:  https://www.biorxiv.org/content/10.1101/576199v1.full.pdf). As usual, it 
all depends on what you’re optimising for, cost, throughput, or turnaround time?

For some of our projects (Darwin Tree of Life being the prime example), you 
don’t know what the reference genome looks like.  The problem is still fuzzy 
grep, but now you’re comparing the reads against each other and looking for 
overlaps, rather than comparing them all independently against the reference.  
You’re doing the jigsaw puzzle without knowing the picture.  That’s a bit 
harder to distribute, and most approaches currently cop out and do it all in 
single large memory machines.  One way to make this easier is to make the reads 
longer (i.e. make the puzzle pieces larger and fewer of them) which is what 
sequencing technologies like Oxford Nanopore and PacBio Sequel try to do.  But 
their throughput is not as high as the short read Illumina approach.

Some people have taken distributed approaches though (JGI’s MetaHipMer for 
example:  https://www.nature.com/articles/s41598-020-67416-5).  That’s tackling 
an even nastier problem; simultaneously sequencing many genomes at the same 
time, for example gut flora from a stool sample, and not only doing de novo 
assembly as in the last example, but trying to do so when you don’t know how 
many different genomes you have in the sample.  So now you have multiple jigsaw 
puzzles mixed up in the same box, and you don’t know any of the pictures.  And 
of course you have multiple strains, so some of those puzzles have the same 
picture but 1% of the pieces are different, and you need to work out which is 
which.

Fun fun fun!

Tim





-- 
 The Wellcome Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Reply via email to