Gerry,

Like others, I too use ganglia - and have a custom script which reports cpu temps (and fan speeds) for the nodes. However, I changed the default method of communication for ganglia (multicast) to reduce the chatter. I use a unicast setup, where each node reports directly to the monitoring server - which is a dedicated machine for monitoring all the systems - and performing other tasks (dhcp, ntp, imaging, etc)

Each node is using less than 1KB/sec to transmit all the ganglia information, including my extra metrics. For the useful recording information you get from this data its worth the rather small network chatter. You can tune the metrics further, turn off the ones you don't want, or have them report less often.

I'd suggest installing it, if you still think it is chatty, then remove it and look for another option. I find it useful in that you can see when a node died, what the load on the node was when it crashed, what the network traffic is, etc...

I also use cacti - but only for the head servers, switches, etc. I find it has too much over head for the nodes. It is however useful in that it can send emails to alert you to problems, and allows for graphing of SNMP devices.

Craig.

Gerry Creager wrote:
Now, for the flame-bait. Bernard suggests cacti and/or ganglia to handle this. Our group have heard some mutterings that ganglia is a "chatty" applicaiton and could cause some potential hits on or 1 Gbe interconnect fabric.

A little background on our current implementation: 126 dual-quad core Xeon Dell 1950's interconnected with gigabit ethernet. No, it's not the world's best MPI machine, but it should... and does... perform admirably for throughput applications where most jobs can be run on a node (or two) but which don't use MPI as much as, e.g., OpenMP, or in some cases, even run on a single core but use all the RAM.

So, we're worried a bit about having everything talk on the same gigabit backplane, hence, so far, no ganglia.

What are the issues I might want to worry about in this regard, especially as we expand this cluster to more nodes (potentially going to 2k cores, or, essentially doubling?

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to