> On Jan 18, 2018, at 4:34 PM, Lachlan Musicman <data...@gmail.com> wrote:
> 
> On 19 January 2018 at 07:29, Ryan Novosielski <novos...@rutgers.edu> wrote:
> Hi all,
> 
> Looked back at the mailing list to see if there was a question about this 
> already. There was some mention of /using/ Nagios, but no real mention of 
> specifics. What do people monitor with Nagios? We monitor, so far, slurmctld, 
> slurmdbd, and MySQL, but there are probably some others. Might be helpful to 
> run “scontrol ping” for example, or similar, on our login nodes.
> 
> Does anyone have any plugins they’ve written or ideas they can share? Nagios 
> Exchange doesn’t have anything with SLURM anywhere in the name.
> 
> Thanks!
> 
> 
> Off the top of my head the only other two that I would want explicitly would 
> be:
>  - ntp/chrony and their respective ntpd. Nodes go offline when the timing 
> slides too far, especially if you are using Munge.
>  - authentication system - in our case ipa/sssd. Without that, even the 
> queued jobs will fail.
> 
> We use Zabbix in house. I was under the impression that people were moving 
> toward icingia2 over Nagios.

I wouldn’t mind moving to Icinga2 over Nagios, but really, it’s more or less a 
nicer version of the same thing, so I’d have the same question with Icinga2.

Thanks for the NTP/Chrony tip though — if I get only that from this thread, it 
will have been worth it. That’s caused us trouble more than once. We do already 
monitor our LDAP, but SSSD is a good idea.

--
____
|| \\UTGERS,     |---------------------------*O*---------------------------
||_// the State  |         Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ  | Office of Advanced Research Computing - MSB C630, Newark
     `'

Attachment: signature.asc
Description: Message signed with OpenPGP

Reply via email to