Of course, one might say, a well configured HPC compute-node shouldn't be getting to a hung point anyways; but in-practice I see a few nodes every month that can be resurrected by a simple reboot. Admittedly these nodes are quite senile.
I think that this is an interesting concept - and don't want to dismiss it. You could imagine jobs which checkpoint often, and automatically restart themselves from a checkpoint if a machine fails like this. My philosophy though would be to leave a machine down till the cause of the crash is established. Now that you have IPMI and serial consoles you should be looking at the IPMI logs and your /var/log/mcelog to see if there are uncorrectable ECC errors, and enabling crash dumps and the Magic Sysrq keys. Any cluster should be designed with a few extra nodes, which will normally be idle but will be used when one or two nodes are off on the Pat and Mick. However, this doesn't help when a large parallel run is brought down when a single node fails - advice here is checkpoint the jobs often. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf