Re: [slurm-users] Nagios or Other Monitoring Plugins

2018-01-20 Thread Ryan Novosielski
Would be interested if anyone knows of any such scripts that already exist. -- || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist

Re: [slurm-users] Nagios or Other Monitoring Plugins

2018-01-19 Thread John Hearns
Not specifically Slurm, but it can be useful to have alerts on jobs which either will never start or which are 'stalled'. You might want to have an alert on jobs which (say) request more slots or nodes than physicall exist, so the users job will never run. Or you can look for 'stalled' jobs where t

Re: [slurm-users] Nagios or Other Monitoring Plugins

2018-01-18 Thread Marcin Stolarek
We're using icinga2 storing accounting data in influxdb for grafana dashboards. In terms of monitoring I prefere end-user functionality, so apart from services we also have a plugin that submits a jobs to cluster (to idle nodes, with a few minutes of deadline) the job simply creates files on shared

Re: [slurm-users] Nagios or Other Monitoring Plugins

2018-01-18 Thread Ryan Novosielski
> On Jan 18, 2018, at 4:34 PM, Lachlan Musicman wrote: > > On 19 January 2018 at 07:29, Ryan Novosielski wrote: > Hi all, > > Looked back at the mailing list to see if there was a question about this > already. There was some mention of /using/ Nagios, but no real mention of > specifics. What

Re: [slurm-users] Nagios or Other Monitoring Plugins

2018-01-18 Thread Michael Gutteridge
We're moving to Prometheus for lots of our monitoring functions. We've got nagios and ganglia in place, but Prometheus and Grafana makes a really nice combo for monitoring and alerting. There's even an exporter for Slurm- https://github.com/vpenso/prometheus-slurm-exporter that includes node data

Re: [slurm-users] Nagios or Other Monitoring Plugins

2018-01-18 Thread Lachlan Musicman
On 19 January 2018 at 07:29, Ryan Novosielski wrote: > Hi all, > > Looked back at the mailing list to see if there was a question about this > already. There was some mention of /using/ Nagios, but no real mention of > specifics. What do people monitor with Nagios? We monitor, so far, > slurmctld

[slurm-users] Nagios or Other Monitoring Plugins

2018-01-18 Thread Ryan Novosielski
Hi all, Looked back at the mailing list to see if there was a question about this already. There was some mention of /using/ Nagios, but no real mention of specifics. What do people monitor with Nagios? We monitor, so far, slurmctld, slurmdbd, and MySQL, but there are probably some others. Migh