2012/9/21 David N. Lombard <dnlom...@ichips.intel.com>: > Our primary approach today is recovery-base resilience, a.k.a., > checkpoint-restart (C/R). I'm not convinced we can continue to rely on that > at exascale.
- Snapshotting seems to be an ugly and inelegant way of solving the problem. For me it is especially laughable considering the general crappyness of acedemic codes in general. It pushes to much onus on the users who, lets face it, are great at science but generally suck at the art of coding :). Saying that. Maybe there will be some kind of super elegant snapshotting library that makes it all work really well. But I doubt it will be universally sexy and, to my ear, sounds like it would bind us to a particular coding paradigm. I might be completely getting the wrong end of the stick however. 2012/9/22 Lux, Jim (337C) <james.p....@jpl.nasa.gov>: > But isn't that basically the old multiport memory or crossbar switch kind > of thing? (Giant memory shared by multiple processors). > > Aside from things like cache coherency, it has scalability problems (from > physical distance reasons: propagation time, if nothing else) - Agreed. Doing distributed memory where processor 1 tries to access the memory of processor 1000 which might be several tens of meters away would(I think) be a non starter because of the propagation and signaling rate versus distance problem. The beardy gods gave us MPI for this :) I started a new thread on RAIM. It does look a bit crossbar I'll grant you :) _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf