On 19 January 2018 at 07:29, Ryan Novosielski <novos...@rutgers.edu> wrote:
> Hi all, > > Looked back at the mailing list to see if there was a question about this > already. There was some mention of /using/ Nagios, but no real mention of > specifics. What do people monitor with Nagios? We monitor, so far, > slurmctld, slurmdbd, and MySQL, but there are probably some others. Might > be helpful to run “scontrol ping” for example, or similar, on our login > nodes. > > Does anyone have any plugins they’ve written or ideas they can share? > Nagios Exchange doesn’t have anything with SLURM anywhere in the name. > > Thanks! > Off the top of my head the only other two that I would want explicitly would be: - ntp/chrony and their respective ntpd. Nodes go offline when the timing slides too far, especially if you are using Munge. - authentication system - in our case ipa/sssd. Without that, even the queued jobs will fail. We use Zabbix in house. I was under the impression that people were moving toward icingia2 over Nagios. Cheers L. ------ "The antidote to apocalypticism is *apocalyptic civics*. Apocalyptic civics is the insistence that we cannot ignore the truth, nor should we panic about it. It is a shared consciousness that our institutions have failed and our ecosystem is collapsing, yet we are still here — and we are creative agents who can shape our destinies. Apocalyptic civics is the conviction that the only way out is through, and the only way through is together. " *Greg Bloom* @greggish https://twitter.com/greggish/status/873177525903609857