Re: [Beowulf] Checkpointing using flash

Andrew Holway Sun, 23 Sep 2012 06:57:49 -0700

2012/9/21 David N. Lombard <dnlom...@ichips.intel.com>:
> Our primary approach today is recovery-base resilience, a.k.a.,
> checkpoint-restart (C/R). I'm not convinced we can continue to rely on that
> at exascale.


- Snapshotting seems to be an ugly and inelegant way of solving the
problem. For me it is especially laughable considering the general
crappyness of acedemic codes in general. It pushes to much onus on the
users who, lets face it, are great at science but generally suck at
the art of coding :). Saying that. Maybe there will be some kind of
super elegant snapshotting library that makes it all work really well.
But I doubt it will be universally sexy and, to my ear, sounds like it
would bind us to a particular coding paradigm. I might be completely
getting the wrong end of the stick however.

2012/9/22 Lux, Jim (337C) <james.p....@jpl.nasa.gov>:
> But isn't that basically the old multiport memory or crossbar switch kind
> of thing? (Giant memory shared by multiple processors).
>
> Aside from things like cache coherency, it has scalability problems (from
> physical distance reasons: propagation time, if nothing else)

- Agreed. Doing distributed memory where processor 1 tries to access
the memory of processor 1000 which might be several tens of meters
away would(I think) be a non starter because of the propagation and
signaling rate versus distance problem. The beardy gods gave us MPI
for this :)

I started a new thread on RAIM. It does look a bit crossbar I'll grant you :)
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Checkpointing using flash

Reply via email to