On 4 Feb 2021, at 10:40, Jonathan Aquilina
<jaquil...@eagleeyet.net<mailto:jaquil...@eagleeyet.net>> wrote:
Maybe SETI@home wasnt the right project to mention, just remembered there is
another project but not in genomics on that distributed platform called
Folding@home.
Right, protein dynamics simulations like that are at the other end of the
data/compute ratio spectrum. Very suitable for distributed computing in that
sort of way.
So with genomics you cannot break it down into smaller chunks where the data
can be crunched then returned to sender and then processed once the data is
back or as its being received?
It depends on what you’re doing. If you already know the reference genome
then, yes you can. We already do this to some extent; the reads from the
sequencing run are de-multiplexed first, and then the reads for each sample are
processed as a separate embarrassingly parallel job. This is basically doing a
jigsaw puzzle when you know the picture.
The read alignment to reference (if you already have a standard reference
genome) easily decomposable as much as you like, right down to a single read in
the extreme case, but the compute for a single read is tiny (this is basically
fuzzy grep going on here), and you’d be swamped in scheduling overhead. For
maximum throughput we don’t bother distributing it further, but use
multithreading on a single node.
There have been some interesting distributed mapping attempts, for example
decomposing the problem into read groups small enough to fit in the time limit
of an AWS lambda function. You get fabulous turnaround time on the analysis if
you do that, but you use about four times as much actual compute time as the
single node, multi-thread approach we currently use. (reference to the lambda
work: https://www.biorxiv.org/content/10.1101/576199v1.full.pdf). As usual, it
all depends on what you’re optimising for, cost, throughput, or turnaround time?
For some of our projects (Darwin Tree of Life being the prime example), you
don’t know what the reference genome looks like. The problem is still fuzzy
grep, but now you’re comparing the reads against each other and looking for
overlaps, rather than comparing them all independently against the reference.
You’re doing the jigsaw puzzle without knowing the picture. That’s a bit
harder to distribute, and most approaches currently cop out and do it all in
single large memory machines. One way to make this easier is to make the reads
longer (i.e. make the puzzle pieces larger and fewer of them) which is what
sequencing technologies like Oxford Nanopore and PacBio Sequel try to do. But
their throughput is not as high as the short read Illumina approach.
Some people have taken distributed approaches though (JGI’s MetaHipMer for
example: https://www.nature.com/articles/s41598-020-67416-5). That’s tackling
an even nastier problem; simultaneously sequencing many genomes at the same
time, for example gut flora from a stool sample, and not only doing de novo
assembly as in the last example, but trying to do so when you don’t know how
many different genomes you have in the sample. So now you have multiple jigsaw
puzzles mixed up in the same box, and you don’t know any of the pictures. And
of course you have multiple strains, so some of those puzzles have the same
picture but 1% of the pieces are different, and you need to work out which is
which.
Fun fun fun!
Tim
--
The Wellcome Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf