On Fri, Sep 21, 2012 at 01:09:41PM -0400, Ellis H. Wilson III wrote: > On 09/21/12 12:58, Lux, Jim (337C) wrote: > > Yes.. If that's the frequency of checkpoints. I was thinking more like 1 > > checkpoint per second or 10 seconds. > > While I suppose they might exist that frequent somehow in the wild, I've > never heard of checkpoints at that low of time interval. These huge > cluster checkpoints are near to the entire memories, so even today we're > talking near to 64 or 128 GB of RAM per node. In ten years we're
Exascale will be likely ARM-like SoCs with stacked memories, including nonvolatile ones (phase change, spintronics, whatever). At >100 GByte/s memory bandwidth you can snapshot at ~Hz without too much penalties. > talking what, near to if not above a TB of RAM per node? Moreover, they I'd rather have MB/node or less. > all tend to write their checkpoint at the same time and the SSDs aren't > on the compute nodes -- they're on some intermediate I/O storage nodes The forthcoming ARM SoCs have typically mSATA SSD at each node. > (akin to BlueGene's intermediate layer). So were talking about huge > cluster-wide dumps of data to the flash intermediate layer, which then > takes some hours to dump that data down to the more persistent HDDs. > This takes at the very least many minutes, and in the normal case hours. > I would not be surprised if the best they could do at exascale was one Exascale won't look like today's clusters. Can't look like today's clusters. > checkpoint a day. Again, I don't think these are used as the front-line > of defense against failures. That would really suck :D. _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf