Re: [Beowulf] Monitoring and Metrics

Paul Edmon Sat, 07 Oct 2017 06:14:23 -0700

So for general monitoring of the cluster usage we use:


https://github.com/fasrc/slurm-diamond-collector

and pipe to Graphana.  We also use XDMod:

http://open.xdmod.org/7.0/index.html

As for specific node alerting, we use the old standby of Nagios.

-Paul Edmon-


On 10/7/2017 8:21 AM, Josh Catana wrote:

This may have been brought up in the past, but I couldn't find much inmy message archive.What are people using for HPC cluster monitoring and metrics lately?I've been low on time to add features to my home grown solution andlooking at some OTS products.I'm looking for something that can do monitoring, alert on condition,broken hardware, etc.Also something that does system resource utilization metrics. If ithas a plug-in for a scheduling system like PBS where I can correlate ajob ID to the metrics of the systems it is currently running on orpreviously ran on at the time, that would be an amazing plus.
Any of you beowulfers have any suggestions?


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Monitoring and Metrics

Reply via email to