Re: [Beowulf] Checkpointing using flash

Lux, Jim (337C) Fri, 21 Sep 2012 09:13:38 -0700

I would suggest that some scheme of redundant computation might be more
effective.. Rather than try to store a single node's state on the node,
and then, if any node hiccups, restore the state (perhaps to a spare), and
restart, means stopping the entire cluster while you recover.

Or, if you can factor your computation to make use of extra processing
nodes, you can just keep on moving.  Think of this as a higher level
scheme than, say, Hamming codes for memory protection:  use 11 bits to
store 8, and you're still synchronous.

Assuming your algorithm has the ability to self detect an error, you could
just use 2N nodes, and only take correct outputs from node I and/or node
I+1 to feed to Node M (and M+1).

This has been done for some specialized algorithms at a lower level (e.g.
FFT) where there are some tricks to know if there was an arithmetic error.
 Or, you could go the brute force Triple/Vote, but that has its share of
problems (the voter has to be very reliable)

Yes, it will require clever algorithm design (of a comparable cleverness
to the design of the original Hamming codes, but more complex),
particularly to find a way to do it generically that is not problem
specific.  But when that is figured out, then we'll really be able to make
progress, because transient (or permanent) failures won't slow down the
computation.

Checkpointing is a fairly crude approach to fault tolerance, after all.

On 9/21/12 8:15 AM, "Justin YUAN SHI" <s...@temple.edu> wrote:

>It looks fairly accurate.
>
>This is because reconcile distributed checkpoints is theoretically
>difficult. Therefore, frequent checkpointing is cost prohibitive for
>exacscale apps.
>
>Justin
>
>On Fri, Sep 21, 2012 at 10:49 AM, Hearns, John <john.hea...@mclaren.com>
>wrote:
>> http://www.theregister.co.uk/2012/09/21/emc_abba/
>>
>>
>>
>> Frequent checkpointing will of course be vital for exascale, given the
>>MTBF
>> of individual nodes.

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Checkpointing using flash

Reply via email to