It looks fairly accurate. This is because reconcile distributed checkpoints is theoretically difficult. Therefore, frequent checkpointing is cost prohibitive for exacscale apps.
Justin On Fri, Sep 21, 2012 at 10:49 AM, Hearns, John <john.hea...@mclaren.com> wrote: > http://www.theregister.co.uk/2012/09/21/emc_abba/ > > > > Frequent checkpointing will of course be vital for exascale, given the MTBF > of individual nodes. > > > > However how accurate is this statement: > > > > HPC jobs involving half a million compute cores ... have a series of > checkpoints set up in their code with the entire memory state stored at each > checkpoint in a storage node. > > > > > > > > John Hearns | CFD Hardware Specialist | McLaren Racing Limited > McLaren Technology Centre, Chertsey Road, Woking, Surrey GU21 4YH, UK > > > T: +44 (0) 1483 262000 > > D: +44 (0) 1483 262352 > > F: +44 (0) 1483 261928 > E: john.hea...@mclaren.com > > W: www.mclaren.com > > > > The contents of this email are confidential and for the exclusive use of the > intended recipient. If you receive this email in error you should not copy > it, retransmit it, use it or disclose its contents but should return it to > the sender immediately and delete your copy. > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf