Re: [Beowulf] Supercomputers face growing resilience problems

Greg Lindahl Fri, 23 Nov 2012 16:33:49 -0800

On Thu, Nov 22, 2012 at 11:19:51PM -0500, Justin YUAN SHI wrote:
> The fundamental problem rests in our programming API. If you look at
> MPI and OpenMP carefully, you will find that these and all others have
> one common assumption: the application-level communication is always
> successful.


Justin,

You keep on saying this, but it's simply not true. MPI implementations
typically retry until success, and if the communication network has a
failure that can be fixed by retry or hot swapping, the application
need not fail.

It's _node_ failure that's the problem, not "application-level
communication" failure.

I've met a lot of people studying adding fault-tolerance into
scientific computing over the past decade and a half, and none of
them have been this unclear when describing the basic issue.

Your description of your proposed solution is also super unclear.

-- greg

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Supercomputers face growing resilience problems

Reply via email to