At my employer, we use a variety of monitoring tools for our various clusters. Our nagios box is a VM with a single processor and 512MB of memory. Currently, we monitor 1700 hosts, each with three or four service checks a piece (two of which SSH to nodes to run scripts). We check services about every 30 minutes.
The load on the central box does get up there are at times, but it is generally responsive and there's not much additional network load. We chose SSH based checks because we were already running Ganglia for statistics monitoring on the nodes and no one wanted to maintain yet another daemon.. It seemed like the best option for us. Best of luck with your cluster monitoring! Alex Younts On Mon, Dec 22, 2008 at 8:28 PM, Rahul Nabar <rpna...@gmail.com> wrote: > I just installed Nagios to try and monitor my 256 compute nodes > centrally. It seems to work like a charm for all the public services > (ping, ssh etc.) but now I was getting more ambitious and wanted to > try to monitor the private services too (disk usage; process loads; > torque ; pbs etc.). > > I was just confused whether (1) to use the NPRE plugin (seems like a > pain to deploy onto all 256 nodes) or (2) go via the check_by_ssh > route. (I already have paswordless logins from master-nodes to > slave-nodes) > > I'd like (2) because it is more secure and seems easier to deploy but > I'm a bit afraid if this will overtax my central server. > > Any suggestions? Are other users using Nagios here? > > -- > Rahul _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf