Hi Diego,

On 21-07-2021 11:56, Diego Zuccato wrote:
I suspendend testing config changes to update another machine. In the last test I added "CPUs=192" to the noe definition, restarted slurmctld and nothing changed.
When I returned, I checked again and slurm reported 192 CPUs! Magic?
I now removed CPUs=192, restarted slurmctld and it keeps seeing all CPUs...
What should I think?

Did you distribute the new slurm.conf to all compute nodes after the change? Did you do "scontrol reconfig" for the slurmd daemons to pick up the changes? This is standard procedure when making any changes to slurm.conf, read about "reconfigure" in the scontrol man-page.

The Configless Slurm (https://slurm.schedmd.com/configless_slurm.html) from 20.02 makes distribution of slurm.conf really simple.

But another problem surfaces: slurmtop seems not to handle so many CPUs gracefully and throws a lot of errors, but that should be something manageable...

For monitoring the state of compute nodes and their jobs, I recommend "pestat" from https://github.com/OleHolmNielsen/Slurm_tools/tree/master/pestat

I use "pestat -F" many times every day to see if any jobs are misbehaving.

/Ole

Reply via email to