On Thu, Nov 22, 2012 at 11:19:51PM -0500, Justin YUAN SHI wrote: > The fundamental problem rests in our programming API. If you look at > MPI and OpenMP carefully, you will find that these and all others have > one common assumption: the application-level communication is always > successful.
Justin, You keep on saying this, but it's simply not true. MPI implementations typically retry until success, and if the communication network has a failure that can be fixed by retry or hot swapping, the application need not fail. It's _node_ failure that's the problem, not "application-level communication" failure. I've met a lot of people studying adding fault-tolerance into scientific computing over the past decade and a half, and none of them have been this unclear when describing the basic issue. Your description of your proposed solution is also super unclear. -- greg _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf