Il 21/07/2021 20:27, Ole Holm Nielsen ha scritto:

Hi Ole.

What should I think?
Did you distribute the new slurm.conf to all compute nodes after the change?
/etc/slurm/slurm.conf is a symlink to /home/conf/slurm.conf, and /home is NFS-mounted on every node. No need to re-distribute it :)

  Did you do "scontrol reconfig" for the slurmd daemons to pick up the changes?
Given the type of changes, I opted for "systemd restart slurmctld" (and restart slurmd on the worker nodes). cssh and bash-completion make it quite fast :)

  This is standard procedure when making any changes to slurm.conf, read about "reconfigure" in the scontrol man-page.
Yup.

The Configless Slurm (https://slurm.schedmd.com/configless_slurm.html) from 20.02 makes distribution of slurm.conf really simple.
Eager to see it in Debian :)

For monitoring the state of compute nodes and their jobs, I recommend "pestat" from https://github.com/OleHolmNielsen/Slurm_tools/tree/master/pestat I use "pestat -F" many times every day to see if any jobs are misbehaving.I'll have a look. I'm also setting up Zabbix for more general monitoring
but I'm not really OK with it yet (for example I still can't understand how I can exclude some metrics from a host that got 'em added by a template... When I'll have enough time I'll find a way :) ). Maybe pestat can be added to the Zabbix metrics...

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786

Reply via email to