Hi Loris,

On 7/23/21 9:05 AM, Loris Bennett wrote:
We use both Zabbix and pestat.  Zabbix gives us general information on
the state of the nodes and file systems, and we have added some Slurm
metrics, such as number of jobs pending, amount of memory pending,
number of GPUs pending, etc.  This has been quite handy, although I find
Zabbix a bit tricky to configure.  This maybe because (a) we are stuck
on Version 3.4 due to the PHP dependency with CentOS 7 and (b) I only do
stuff very irregularly with Zabbix and so always have to start somewhat
from scratch.

I prefer simple tools, if possible :-) For monitoring Slurm compute nodes, I'm fully satisfied with the LBNL Node Health Check tools. This offers checks of disk space, memory, GPUs, Infiniband and much more. See
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#node-health-check

For monitoring the Slurm queue and pending jobs, I use the "showuserjobs" script from https://github.com/OleHolmNielsen/Slurm_tools/tree/master/showuserjobs

pestat on the other hand gives us more information about what individual
jobs on individual nodes are up to at a given point in time.  I don't
quite see how one could integrate pestat itself directly into Zabbix, as
it is more geared to producing a report, but maybe Ole has ideas :-)

Sorry, no ideas because I'm not familiar with Zabbix.

/Ole

Reply via email to