> On 3 Feb 2021, at 18:23, Jörg Saßmannshausen <sassy-w...@sassy.formativ.net> 
> wrote:
> 
> Hi John,
> 
> interesting stuff and good reading. 
> 
> For the IT interests on here: these sequencing machine are chucking out large 
> amount of data per day. The project I am involved in can chew out 400 GB or 
> so 
> on raw data per day. That is a small machine. That then needs to be processed 
> before you actually can analyze it. So there is quite some data movement etc 
> involved here. 


If anyone wants any details, just ask me, since the IT supporting all that 
sequencing is my team’s baby.

Actually, the sequencing capacity for this volume of COVID samples is not 
great.  The virus genome is so small (only 30,000 bases, compared to a human’s 
3 billion base pairs) that you can massively multiplex the samples in a single 
sequencing run.

Currently, we multiplex 384 samples per Novaseq sequencing lane.  There are 
four lanes per flowcell, and two flowcells per sequencer.  The sequencing run 
takes about 24 hours, so each instrument can sequence about 3,000 samples per 
day.

We have about 20 of these sequencers, so our total capacity is very high; in 
fact we only use three sequencers for COVID at the moment, because sample and 
library preparation is actually the bottleneck.  Getting those 384 samples 
ready for the sequencer.  We are planning to increase it though, both by 
increasing multiplexing and by using more sequencers.

Sequencing itself is a bit less than a day, and the computational analysis to 
de-multiplex and reconstruct the genomes is less than a day running on our 
production-oriented OpenStack cluster (we keep critical projects like Heron on 
a physically separate cluster from normal faculty research); we can easily keep 
up with the sequencers.  We then upload our results to the folks at CLIMB, and 
that’s where the comparative genomics tends to take place.

There’s a lot of effort at the moment going into speeding up the end-to-end 
process; for this sequencing to be as useful as possible for close-to-real-time 
outbreak and mutation analysis, the turnaround time needs to be as short as 
possible.  It turns out you can see statistically significant new mutation 
signatures very early on before infection rates really start to rise (this was 
visible in Kent data for B.1.1.7), so the sooner we can see this sort of thing 
the better we will get at taking appropriate measures.

For more details on the actual analysis, we released a public seminar a couple 
of weeks ago:

https://stream.venue-av.com/e/sanger_seminars/Barrett

Tim




-- 
 The Wellcome Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Reply via email to