I think the Redundant Memory paper was really mis-configured. It uses
a storage solution, trying to solve a volatle memory problem but
insisting on eliminating volatility. It looks very much messed up.

My early comment on the OSI model still stands, even though MPI
implementation is far down the stack that may not fit the OSI model
well. The MPI implementation, even at the transport layer does NOT
re-transmit messages.

As you know there are semantic differences between an MPI message and
a packet. Reliable packet transmission does not equal to reliable
message transamission. When machine hangs running MPI protocol stack,
the entire app hangs. Therefore, this is the root cause for all our
fault tolerance problems.

It also seems hard to fix this. This is caused by the MPI direct
messaging interface design (except for the group communication). The
current group communication protocol implementation still does not
handle the issue.

Justin

On Mon, Sep 24, 2012 at 4:52 AM, Andrew Holway <andrew.hol...@gmail.com> wrote:
>> I made a sketch :) http://bit.ly/TlkHpH
>
> Really? scheduled downtime? on a monday morning?
>
> new link :) http://bit.ly/RbpKW8
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to