> On 10/7/2017 8:21 AM, Josh Catana wrote: > > This may have been brought up in the past, but I couldn't find much in my > message archive. > What are people using for HPC cluster monitoring and metrics lately? I've > been low on time to add features to my home grown solution and looking at > some OTS products. > I'm looking for something that can do monitoring, alert on condition, > broken hardware, etc. > Also something that does system resource utilization metrics. If it has a > plug-in for a scheduling system like PBS where I can correlate a job ID to > the metrics of the systems it is currently running on or previously ran on > at the time, that would be an amazing plus. > Any of you beowulfers have any suggestions? > > We use XDMoD and Zabbix for per machine monitoring. Logwatch as well, but not as comprehensively.
Tried Grafana, InfluxDB and this plugin ( http://slurm.schedmd.com/SLUG16/monitoring_influxdb_slug.pdf ) but we didn't find it as useful as we would have liked. It's a great plugin, we just didn't need it. cheers L. ------ "The antidote to apocalypticism is *apocalyptic civics*. Apocalyptic civics is the insistence that we cannot ignore the truth, nor should we panic about it. It is a shared consciousness that our institutions have failed and our ecosystem is collapsing, yet we are still here — and we are creative agents who can shape our destinies. Apocalyptic civics is the conviction that the only way out is through, and the only way through is together. " *Greg Bloom* @greggish https://twitter.com/greggish/ status/873177525903609857
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf