This is interesting stuff.
Think back a few years when we were talking about checkpoint/restart issues: as 
the scale of your problem gets bigger, the time to checkpoint becomes bigger 
than the time actually doing useful work.
And, of course, the reason we do checkpoint/restart is because it’s bare-metal 
and easy.  Just like simple message passing is “close to the metal” and 
“straightforward”.

Similarly, there’s “fine grained” error detection and correction: ECC codes in 
memory; redundant comm links or retries.  Each of them imposes some 
speed/performance penalty (it takes some non-zero time to compute the syndrome 
bits in a ECC, and some non-zero time to fix the errored bits… in a lot of 
systems these days, that might be buried in a pipeline, but the delay is there, 
and affects performance)

I think of ECC as a sort of diffuse fault management: it’s pervasive, uniform, 
and the performance penalty is applied evenly through the system.  Redundant 
(in the TMR sense) links are the same way.

Retries are a bit different.  The “detecting” a fault is diffuse and pervasive 
(e.g. CRC checks occur on each message), but the correction of the fault is 
discrete and consumes resources at that time.  In a system with tight time 
coupling (a  pipelined systolic array would be the sort of worst case), many 
nodes have to wait to fix the one that failed.

A lot depends on the application: tighter time coupling is worse than 
embarrassingly parallel (which is what a lot of the “big data” stuff is: 
fundamentally EP, scatter the requests, run in parallel, gather the results).

The challenge is doing stuff in between:  You may have a flock with excess 
capacity (just as ECC memory might have 1.5N physical storage bits to be used 
to store N bits), but how do you automatically distribute the resources to be 
failure tolerant.   The original post in the thread points out that MPI is not 
a particularly facile tool for doing this.  But I’m not sure that there is a 
tool, and I’m not sure that MPI is the root of the lack of tools.    I think 
it’s that moving from close to the metal is a “hard problem” to do in a generic 
way.  (The issues about 32 bit counts are valid, though)


James Lux, P.E.
Task Manager, DHFR Space Testbed
Jet Propulsion Laboratory
4800 Oak Grove Drive, MS 161-213
Pasadena CA 91109
+1(818)354-2075
+1(818)395-2714 (cell)

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to