Referring to lambda functions, I think I flagged up that AWS now supports containers up to 10GB in size for the lambda payload https://aws.amazon.com/blogs/aws/new-for-aws-lambda-container-image-support/
which makes a Julia language lambda possible https://www.youtube.com/watch?v=6DvpneWRb_w On Thu, 4 Feb 2021 at 11:49, Tim Cutts <t...@sanger.ac.uk> wrote: > > > On 4 Feb 2021, at 10:40, Jonathan Aquilina <jaquil...@eagleeyet.net> > wrote: > > Maybe SETI@home wasnt the right project to mention, just remembered there > is another project but not in genomics on that distributed platform called > Folding@home. > > > Right, protein dynamics simulations like that are at the other end of the > data/compute ratio spectrum. Very suitable for distributed computing in > that sort of way. > > So with genomics you cannot break it down into smaller chunks where the > data can be crunched then returned to sender and then processed once the > data is back or as its being received? > > > It depends on what you’re doing. If you already know the reference genome > then, yes you can. We already do this to some extent; the reads from the > sequencing run are de-multiplexed first, and then the reads for each sample > are processed as a separate embarrassingly parallel job. This is basically > doing a jigsaw puzzle when you know the picture. > > The read alignment to reference (if you already have a standard reference > genome) easily decomposable as much as you like, right down to a single > read in the extreme case, but the compute for a single read is tiny (this > is basically fuzzy grep going on here), and you’d be swamped in scheduling > overhead. For maximum throughput we don’t bother distributing it further, > but use multithreading on a single node. > > There have been some interesting distributed mapping attempts, for example > decomposing the problem into read groups small enough to fit in the time > limit of an AWS lambda function. You get fabulous turnaround time on the > analysis if you do that, but you use about four times as much actual > compute time as the single node, multi-thread approach we currently use. > (reference to the lambda work: > https://www.biorxiv.org/content/10.1101/576199v1.full.pdf). As usual, it > all depends on what you’re optimising for, cost, throughput, or turnaround > time? > > For some of our projects (Darwin Tree of Life being the prime example), > you don’t know what the reference genome looks like. The problem is still > fuzzy grep, but now you’re comparing the reads against each other and > looking for overlaps, rather than comparing them all independently against > the reference. You’re doing the jigsaw puzzle without knowing the > picture. That’s a bit harder to distribute, and most approaches currently > cop out and do it all in single large memory machines. One way to make > this easier is to make the reads longer (i.e. make the puzzle pieces larger > and fewer of them) which is what sequencing technologies like Oxford > Nanopore and PacBio Sequel try to do. But their throughput is not as high > as the short read Illumina approach. > > Some people have taken distributed approaches though (JGI’s MetaHipMer for > example: https://www.nature.com/articles/s41598-020-67416-5). That’s > tackling an even nastier problem; simultaneously sequencing many genomes at > the same time, for example gut flora from a stool sample, and not only > doing *de novo* assembly as in the last example, but trying to do so when > you don’t know how many different genomes you have in the sample. So now > you have multiple jigsaw puzzles mixed up in the same box, and you don’t > know any of the pictures. And of course you have multiple strains, so some > of those puzzles have the same picture but 1% of the pieces are different, > and you need to work out which is which. > > Fun fun fun! > > Tim > > > -- The Wellcome Sanger Institute is operated by Genome Research Limited, a > charity registered in England with number 1021457 and a company registered > in England with number 2742969, whose registered office is 215 Euston Road, > London, NW1 2BE. > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf >
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf