Re: [Beowulf] Monitoring and Metrics

2017-10-07 Thread lange
> On Sat, 7 Oct 2017 08:21:08 -0400, Josh Catana said: > This may have been brought up in the past, but I couldn't find much in my message  archive. > What are people using for HPC cluster monitoring and metrics lately? I've been low on time to add features to my home grown solution

Re: [Beowulf] Monitoring and Metrics

2017-10-07 Thread Lachlan Musicman
> On 10/7/2017 8:21 AM, Josh Catana wrote: > > This may have been brought up in the past, but I couldn't find much in my > message archive. > What are people using for HPC cluster monitoring and metrics lately? I've > been low on time to add features to my home grown solution and looking at > some

Re: [Beowulf] Monitoring and Metrics

2017-10-07 Thread Paul Edmon
So for general monitoring of the cluster usage we use: https://github.com/fasrc/slurm-diamond-collector and pipe to Graphana.  We also use XDMod: http://open.xdmod.org/7.0/index.html As for specific node alerting, we use the old standby of Nagios. -Paul Edmon- On 10/7/2017 8:21 AM, Josh Cat

[Beowulf] Monitoring and Metrics

2017-10-07 Thread Josh Catana
This may have been brought up in the past, but I couldn't find much in my message archive. What are people using for HPC cluster monitoring and metrics lately? I've been low on time to add features to my home grown solution and looking at some OTS products. I'm looking for something that can do mo