Or, if you can factor your computation to make use of extra processing nodes, you can just keep on moving. Think of this as a higher level scheme than, say, Hamming codes for memory protection: use 11 bits to store 8, and you're still synchronous.
Jim, you are smarter than me! IW as going to air the idea of pairs of nodes in lock-step, with either node being able to STONITH the other if either there is a machine check event, or the other node does not keep up with reporting results. Then signal to the cluster management that "There's been a failure here - but lets keep trucking to the end of the run, When you can come along and replace my buddy and me" The obvious drawback being you get half an exaflop for your money! The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf