Re: [Beowulf] Compute Node OS on Local Disk vs. Ram Disk

Bogdan Costescu Tue, 30 Sep 2008 12:18:05 -0700

On Tue, 30 Sep 2008, Jon Forrest wrote:

The trouble with rebooting nodes is that this takes human energy.

When using a queueing system, rebooting nodes can be automated easily:- the node to be rebooted is switched to "offline" state so that thescheduler doesn't attempt to start new jobs on it

- wait until the currently running job finishes
- reboot

- put the node back "online" so that the scheduler can again startjobs on it

All the steps except the reboot itself are interactions with thequeueing system and can happen on the frontend/master node only. Thereboot step requires some interaction with the node, either remoteshell access to run /sbin/reboot or some other way to restart it(IPMI, remote power management, etc.)

It's easier to keep nodes up as long possible

With the increasing number of nodes in clusters these days, theoverall failure rate also increases. It's much easier to deal withfailures when they are not seen as a catastrophe, "twist my fingersand hope that the node is coming up properly and everything stillworks" kind, but rather as nodes simply going up and down.

This is a good idea. Can you write more about this?

The e-mail from Brian Oborn has described in a few words theprinciple, probably better than I could have done it myself. If youwant more details, ask more precise questions and I guess that any ofus could answer.


--
Bogdan Costescu

IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8869/8240, Fax: +49 6221 54 8868/8850
E-mail: [EMAIL PROTECTED]
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Compute Node OS on Local Disk vs. Ram Disk

Reply via email to