At HMS we do the same as Paul's cluster and specify the groups we want to have
access to all our compute nodes, we allow two groups that represent our DevOps
team and our Research Computing consultants to have access and then
corresponding sudo rules for each group to allow different command se
Hi Paul,
There could be multiple reasons why the job isn't running, from the user's QOS
to your cluster hitting MaxJobCount. This page might help:
https://slurm.schedmd.com/high_throughput.html
The output of the following command might help:
scontrol show job 465072
Regards
--
Mick Timony
Se
We set SlurmdTimeout=600. The docs say not to go any higher than 65533 seconds:
https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdTimeout
The FAQ has info about SlurmdTimeout also. The worst thing that could happen is
will take longer to set nodes as being down:
>A node is set DOWN when the s