On Mon, 22 Sep 2008, Joe Landman wrote:

Prentice Bisbal wrote:
The more services you run on your cluster node (gmond, sendmail, etc.)
the less performance is available for number crunching, but at the same
time, administration difficulty increases. For example, if you turn off
postfix/sendmail, you'll no longer get automated e-mails from your
system to alert you to a problem.

Does every node need to be running sendmail/postfix? In most cases, nodes should be fairly "dumb", in the sense of having as absolutely little as possible actively running. They largely need little more than an authentication service, a login/process start service, a disk service (NFS, panfs, glusterfs, ... ...).

One can always run xmlsysd instead, which is a very lightweight
on-demand information service.  It costs you, basically, a socket, and
you can poll the nodes to get their current runstate every five seconds,
every thirty seconds, every minute, every five minutes.  Pick a
granularity that drops its impact on a running computation to a level
you consider tolerable, while still providing you with node-level state
information when you need it.

Just a thought...;-)

   rgb


My question is this: how extreme do you go in disabling non-essential
services on your cluster nodes? Do you turn off *everything* that's not
absolutely necessary, do you leave somethings running to make
administration easier?

As long as you have an ssh portal in as root, you should be fine for admin. Though, from an admin point of view, as you scale up the number of nodes, you want the admin load to remain constant, that is, not to scale with increasing node count. Moreover, you want to actively reduce the number of moving parts, as it were, as you scale up, as moving parts tend to break. These are things like installs, or images. We have customers who occasionally (against our advice) test the limits of their "cluster installer". What is interesting is that they can't *successfully* install/image more than about 20-24 successfully at a time. Yes they can install more than that, but no, the systems they install that way seem to have some problems which go away at next reload.

Basically as you scale up the system, you want to scale down, if not completely eliminate, node level admin. You definitely don't want the nodes to be spending cycles (and therefore power, time, resources) on things that they really ought not to spend time on.

Joe


I'm curious to see how everyone else has their cluster(s) configured.





--
Robert G. Brown                            Phone(cell): 1-919-280-8443
Duke University Physics Dept, Box 90305
Durham, N.C. 27708-0305
Web: http://www.phy.duke.edu/~rgb
Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php
Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to