On 9/23/12 6:57 AM, "Andrew Holway" <andrew.hol...@gmail.com> wrote:
>2012/9/21 David N. Lombard <dnlom...@ichips.intel.com>: >> Our primary approach today is recovery-base resilience, a.k.a., >> checkpoint-restart (C/R). I'm not convinced we can continue to rely on >>that >> at exascale. > >- Snapshotting seems to be an ugly and inelegant way of solving the >problem. For me it is especially laughable considering the general >crappyness of acedemic codes in general. It pushes to much onus on the >users who, lets face it, are great at science but generally suck at >the art of coding :). Saying that. Maybe there will be some kind of >super elegant snapshotting library that makes it all work really well. >But I doubt it will be universally sexy and, to my ear, sounds like it >would bind us to a particular coding paradigm. I might be completely >getting the wrong end of the stick however. Snapshot/Checkpoint *is* a brute force way, particularly for dealing with hardware failures. We used to do it to deal with power interruptions on exhaustive search algorithms that took days. But it might be the only way to do a "algorithm blind" approach. > >2012/9/22 Lux, Jim (337C) <james.p....@jpl.nasa.gov>: >> But isn't that basically the old multiport memory or crossbar switch >>kind >> of thing? (Giant memory shared by multiple processors). >> >> Aside from things like cache coherency, it has scalability problems >>(from >> physical distance reasons: propagation time, if nothing else) > >- Agreed. Doing distributed memory where processor 1 tries to access >the memory of processor 1000 which might be several tens of meters >away would(I think) be a non starter because of the propagation and >signaling rate versus distance problem. The beardy gods gave us MPI >for this :) The problem (such as it is) is that devising computational algorithms that are aware of (or better, make use) of propagation delays is *hard*. Think about the old days (before my time) when people used to optimize for placement on the drum. That was an easy problem. Dealing with errors.. The Feynman story of running simulations with punch card equipment with different colored cards is an ad hoc specialized solution. Maybe a similar one is optimizing for vector machines and pipelined processing or systolic arrays. Systolic array approaches definitely can deal with the speed of light problem: latency through the system is longer than 1/computation rate; but it's hard to find a generalized approach. > _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf