Am 25.09.2012 um 12:19 schrieb Andrew Holway: > <snip> > Im pretty sure faulty hardware is the root cause of out fault > tolerance problems :). In any case the main issue seems to be the loss > of a chunk of your application memory when the node fail not so much > the retransmission of messages. MPI has some kind of functionality > inside to address fault tolerance anyway.
If you are interested: there was a lot of discussion about FT in MPI3. There is a mailing list: http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft -- Reuti _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf