Hi Diego,
On 7/23/21 8:16 AM, Diego Zuccato wrote:
The Configless Slurm (https://slurm.schedmd.com/configless_slurm.html)
from 20.02 makes distribution of slurm.conf really simple.
Eager to see it in Debian :)
IMHO, there ought to be a community effort to provide up-to-date Slurm
packages for Debian (and Ubuntu), just like a colleague did for the EPEL
repository for RHEL and derivatives ;-) We run CentOS and can trivially
build new RPMs from the Slurm source tar-balls.
For monitoring the state of compute nodes and their jobs, I recommend
"pestat" from
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/pestat
I use "pestat -F" many times every day to see if any jobs are
misbehaving.I'll have a look. I'm also setting up Zabbix for more
general monitoring
but I'm not really OK with it yet (for example I still can't understand
how I can exclude some metrics from a host that got 'em added by a
template... When I'll have enough time I'll find a way :) ). Maybe pestat
can be added to the Zabbix metrics...
Did you check out what pestat can do (and maybe not do) for you? If you
have any suggestions for improving pestat, I'd be glad to see what I can do.
/Ole