I would suggest that some scheme of redundant computation might be more effective.. Rather than try to store a single node's state on the node, and then, if any node hiccups, restore the state (perhaps to a spare), and restart, means stopping the entire cluster while you recover.
Or, if you can factor your computation to make use of extra processing nodes, you can just keep on moving. Think of this as a higher level scheme than, say, Hamming codes for memory protection: use 11 bits to store 8, and you're still synchronous. Assuming your algorithm has the ability to self detect an error, you could just use 2N nodes, and only take correct outputs from node I and/or node I+1 to feed to Node M (and M+1). This has been done for some specialized algorithms at a lower level (e.g. FFT) where there are some tricks to know if there was an arithmetic error. Or, you could go the brute force Triple/Vote, but that has its share of problems (the voter has to be very reliable) Yes, it will require clever algorithm design (of a comparable cleverness to the design of the original Hamming codes, but more complex), particularly to find a way to do it generically that is not problem specific. But when that is figured out, then we'll really be able to make progress, because transient (or permanent) failures won't slow down the computation. Checkpointing is a fairly crude approach to fault tolerance, after all. On 9/21/12 8:15 AM, "Justin YUAN SHI" <s...@temple.edu> wrote: >It looks fairly accurate. > >This is because reconcile distributed checkpoints is theoretically >difficult. Therefore, frequent checkpointing is cost prohibitive for >exacscale apps. > >Justin > >On Fri, Sep 21, 2012 at 10:49 AM, Hearns, John <john.hea...@mclaren.com> >wrote: >> http://www.theregister.co.uk/2012/09/21/emc_abba/ >> >> >> >> Frequent checkpointing will of course be vital for exascale, given the >>MTBF >> of individual nodes. _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf