On Tue, 30 Sep 2008, Jon Forrest wrote:

The trouble with rebooting nodes is that this takes human energy.

When using a queueing system, rebooting nodes can be automated easily: - the node to be rebooted is switched to "offline" state so that the scheduler doesn't attempt to start new jobs on it
- wait until the currently running job finishes
- reboot
- put the node back "online" so that the scheduler can again start jobs on it

All the steps except the reboot itself are interactions with the queueing system and can happen on the frontend/master node only. The reboot step requires some interaction with the node, either remote shell access to run /sbin/reboot or some other way to restart it (IPMI, remote power management, etc.)

It's easier to keep nodes up as long possible

With the increasing number of nodes in clusters these days, the overall failure rate also increases. It's much easier to deal with failures when they are not seen as a catastrophe, "twist my fingers and hope that the node is coming up properly and everything still works" kind, but rather as nodes simply going up and down.

This is a good idea. Can you write more about this?

The e-mail from Brian Oborn has described in a few words the principle, probably better than I could have done it myself. If you want more details, ask more precise questions and I guess that any of us could answer.

--
Bogdan Costescu

IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8869/8240, Fax: +49 6221 54 8868/8850
E-mail: [EMAIL PROTECTED]
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to