On 09/21/12 10:49, Hearns, John wrote: > http://www.theregister.co.uk/2012/09/21/emc_abba/ > > Frequent checkpointing will of course be vital for exascale, given the > MTBF of individual nodes. > > However how accurate is this statement: > > HPC jobs involving half a million compute cores ... have a series of > checkpoints set up in their code with the entire memory state stored at > each checkpoint in a storage node.
Are your concerns about the accuracy of this statement related to the fact that elReg is claiming that they must dump "the entire memory" or some concern about flash being used as a temporary checkpointing medium? If the former -- note that with many, many physics and climate codes the application data dominates memory. So while it may not be technically true that the "entire memory" is dumped in the checkpoint (the OS certainly won't/shouldn't dump it's own memory), it is effectively true because 90% of the memory does end up getting dumped. For what it's worth, flash (or some other reasonably dense medium faster than disk) being used in exascale machines is an absolute necessity for checkpointing according to my research and discussions. I was lucky enough to sit in on a talk by Gary Grider of LANL last Fall (the guy that basically designs and signs off on the purchase of their largest clusters, from what I understand) and John Bent (also of LANL, now at EMC). They explained the nasty costs involved if they went totally disk or totally flash. A hybrid solution was effectively the only cost-effective way to do this for them, and I expect we'll see similar trends in other labs in the near future. I don't even think he was talking full exascale either -- like 100 petaflop. Disclaimer: Possible Bias -- My research is on flash development and caching for cluster computing at PSU. Best, ellis _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf