Hi Diego,
On 21-07-2021 11:56, Diego Zuccato wrote:
I suspendend testing config changes to update another machine. In the
last test I added "CPUs=192" to the noe definition, restarted slurmctld
and nothing changed.
When I returned, I checked again and slurm reported 192 CPUs! Magic?
I now removed CPUs=192, restarted slurmctld and it keeps seeing all CPUs...
What should I think?
Did you distribute the new slurm.conf to all compute nodes after the
change? Did you do "scontrol reconfig" for the slurmd daemons to pick
up the changes? This is standard procedure when making any changes to
slurm.conf, read about "reconfigure" in the scontrol man-page.
The Configless Slurm (https://slurm.schedmd.com/configless_slurm.html)
from 20.02 makes distribution of slurm.conf really simple.
But another problem surfaces: slurmtop seems not to handle so many CPUs
gracefully and throws a lot of errors, but that should be something
manageable...
For monitoring the state of compute nodes and their jobs, I recommend
"pestat" from
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/pestat
I use "pestat -F" many times every day to see if any jobs are misbehaving.
/Ole