Regardless how low MPI stack goes, it has never "punched" through the packet retransmission layer. Therefore, the OSI model serves as a template to illustrate the point of discussion.
Justin On Sep 22, 2012, at 10:34 AM, "Lux, Jim (337C)" <james.p....@jpl.nasa.gov> wrote: > I see MPI as sitting much lower (network or transport, perhaps) > > Maybe for this (as in many other cases) the OSI model is not an > appropriate one. > That is, most practical systems have more blending between layers, and > outright punching through. There are a variety of high level > protocols/algorithms that can make effective use of current (and > predicted) low level behavior to optimize the high level behavior. > > In any case, OSI is more of a conceptualization tool when looking at a > system. > > > But I agree that transient faults (whether failure that succeeds on retry > (temporal redundancy), or failure that prompts use of a redundant unit > (spatial redundancy) are what you need to deal with. > > There is a very huge literature on this, for all sorts of scenarios.. The > Byzantine Generals problem is a classic example. Synchronization of time > in a system is another. > > The challenge is in devising generalized software approaches that can > effectively make use of redundancy (in whatever form). By effective, I > mean that the computation runs in the same amount of time (or consumes the > same amount of some other resource) regardless of the occurrence of some > number of failures. From an information theory standpoint, I think that > means you MUST have redundancy, and the trick is efficient use of that > redundancy. > > For communications channels, we are at the point where coding can get you > within hundredths of a dB of the Shannon limit. > > For algorithms, not so much. We've got good lossless compression > algorithms, which is a start. You remove redundancy from the input data > stream, reducing the user data rate, and you can make use of the "extra" > bandwidth to do effective coding to mitigate errors in the channel. > > However, while this compress/error correcting code/decompress is more > reliable/efficient, it does have longer latency (Shannon just sets the > limit, and assumes you have infinite memory on both ends of the link). > > So in a computational scenario, that latency might be a real problem. > > > > > On 9/22/12 3:42 AM, "Justin YUAN SHI" <s...@temple.edu> wrote: > >> Ellis: >> >> If we go to a little nitty-gritty detail view, you will see that >> transient faults are the ultimate enemies of exacscale computing. The >> problem, if we really go to the nitty-gritty details, stems from >> mismatch between the MPI assumptions and what the OSI model promises. >> >> To be exact, the OSI layers 1-4 can defend packet data losses and >> corruptions against transient hardware and network failures. Layers >> 5-7 provides no protection. MPI sits on top of layer 7. And it assumes >> that every transmission must be successful (this is why we have to use >> checkpoint in the first place) -- a reliability assumption that the >> OSI model have never promised. >> >> In other words, any transient fault while processing the codes in >> layers 5-7 (and MPI calls) can halt the entire app. >> >> Justin >> >> >> >> On Fri, Sep 21, 2012 at 12:29 PM, Ellis H. Wilson III <el...@cse.psu.edu> >> wrote: >>> On 09/21/12 12:13, Lux, Jim (337C) wrote: >>>> I would suggest that some scheme of redundant computation might be more >>>> effective.. Rather than try to store a single node's state on the node, >>>> and then, if any node hiccups, restore the state (perhaps to a spare), >>>> and >>>> restart, means stopping the entire cluster while you recover. >>> >>> I am not 100% about the nitty-gritty here, but I do believe there are >>> schemes already in place to deal with single node failures. What I do >>> know for sure is that checkpoints are used as a last line of defense >>> against full cluster failure due to overheating, power failure, or >>> excessive numbers of concurrent failures -- not for just one node going >>> belly up. >>> >>> The LANL clusters I was learning about only checkpointed every 4-6 hours >>> or so, if I remember correctly. With hundred-petascale clusters and >>> beyond hitting failure rates on the frequency of not even hours but >>> minutes, obviously checkpointing is not the go-to first attempt at >>> failure recovery. >>> >>> If I find some of the nitty-gritty I'm currently forgetting about how >>> smaller, isolated failures are handled now I'll report back. >>> >>> Nevertheless, great ideas Jim! >>> >>> Best, >>> >>> ellis >>> _______________________________________________ >>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing >>> To change your subscription (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf > _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf