On Fri, 21 Sep 2012, Lux, Jim (337C) wrote:
On 9/21/12 9:21 AM, "Hearns, John" <john.hea...@mclaren.com> wrote:
Or, if you can factor your computation to make use of extra processing
nodes, you can just keep on moving. Think of this as a higher level
scheme than, say, Hamming codes for memory protection: use 11 bits to
store 8, and you're still synchronous.
Jim, you are smarter than me!
IW as going to air the idea of pairs of nodes in lock-step, with either
node being able to STONITH the other if
either there is a machine check event, or the other node does not keep up
with reporting results.
Then signal to the cluster management that "There's been a failure here -
but lets keep trucking to the end of the run,
When you can come along and replace my buddy and me"
The obvious drawback being you get half an exaflop for your money!
I was assuming that you'd figure out a Hamming-esque way to get 8/11ths of
an exaflop for an exaflops worth of horsepower.
Hm, yeah, probably not happening... as the intermediate step of
computing the encoding is likely to be a more difficult problem by far
than what the cluster is actually working on...;-)
rgb
It might actually be an ok trade without the future "Hearns Code",
though.. Can you get computers with double the failure rate for less than
half the cost (all in, capex and opex)? Given that we are inevitably
moving this way, maybe "design for perfect" isn't an appropriate paradigm.
In the space biz, this is a HUGE issue.. For all we spend trying to make
perfect, we don't, so is it time to bite the bullet and "design for
failure"... I think it is, but, there are those with beards grayer than
mine (and mine has a fair amount of gray in it) who don¹t.
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:r...@phy.duke.edu
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf