On Fri, Sep 21, 2012 at 02:49:32PM +0000, Hearns, John wrote: > http://www.theregister.co.uk/2012/09/21/emc_abba/ > > Frequent checkpointing will of course be vital for exascale, given the MTBF > of individual nodes.
Individual nodes have very good MTBF. It's /system/ scale that causes problems for system MTBF. Take a look at Christian Enelmann's presentation at http://www.csm.ornl.gov/~engelman/publications/engelmann10resilience.ppt.pdf Our primary approach today is recovery-base resilience, a.k.a., checkpoint-restart (C/R). I'm not convinced we can continue to rely on that at exascale. Having written that, we can clearly improve on C/R overheads with various techniques, including NVM. A number of papers have discussed the use of NVM to reduce overheads so that we can continue to rely on C/R. See these for example http://dl.acm.org/citation.cfm?id=1654117 http://dl.acm.org/citation.cfm?id=1845215 > However how accurate is this statement: > > HPC jobs involving half a million compute cores ... have a series of > checkpoints set up in their code with the entire memory state stored at each > checkpoint in a storage node. > We're not concerned about the "entire memory state". Application-level checkpointing only saves an application-dependent portion of the program's data. Granted, this could still be a /large/ fraction of system memory. Storing the checkpoint in persistent storage, but *not* "a storage node", is one current approach. Storing in other nodes' memory, e.g., diskless checkpoint, is another. -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf