On 09/21/12 12:58, Lux, Jim (337C) wrote: > Yes.. If that's the frequency of checkpoints. I was thinking more like 1 > checkpoint per second or 10 seconds.
While I suppose they might exist that frequent somehow in the wild, I've never heard of checkpoints at that low of time interval. These huge cluster checkpoints are near to the entire memories, so even today we're talking near to 64 or 128 GB of RAM per node. In ten years we're talking what, near to if not above a TB of RAM per node? Moreover, they all tend to write their checkpoint at the same time and the SSDs aren't on the compute nodes -- they're on some intermediate I/O storage nodes (akin to BlueGene's intermediate layer). So were talking about huge cluster-wide dumps of data to the flash intermediate layer, which then takes some hours to dump that data down to the more persistent HDDs. This takes at the very least many minutes, and in the normal case hours. I would not be surprised if the best they could do at exascale was one checkpoint a day. Again, I don't think these are used as the front-line of defense against failures. That would really suck :D. Best, ellis _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf