I think the Redundant Memory paper was really mis-configured. It uses a storage solution, trying to solve a volatle memory problem but insisting on eliminating volatility. It looks very much messed up.
My early comment on the OSI model still stands, even though MPI implementation is far down the stack that may not fit the OSI model well. The MPI implementation, even at the transport layer does NOT re-transmit messages. As you know there are semantic differences between an MPI message and a packet. Reliable packet transmission does not equal to reliable message transamission. When machine hangs running MPI protocol stack, the entire app hangs. Therefore, this is the root cause for all our fault tolerance problems. It also seems hard to fix this. This is caused by the MPI direct messaging interface design (except for the group communication). The current group communication protocol implementation still does not handle the issue. Justin On Mon, Sep 24, 2012 at 4:52 AM, Andrew Holway <andrew.hol...@gmail.com> wrote: >> I made a sketch :) http://bit.ly/TlkHpH > > Really? scheduled downtime? on a monday morning? > > new link :) http://bit.ly/RbpKW8 > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf