We set SlurmdTimeout=600. The docs say not to go any higher than 65533 seconds:
https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdTimeout
The FAQ has info about SlurmdTimeout also. The worst thing that could happen is
will take longer to set nodes as being down:
>A node is set DOWN when the s
Hi Paul,
There could be multiple reasons why the job isn't running, from the user's QOS
to your cluster hitting MaxJobCount. This page might help:
https://slurm.schedmd.com/high_throughput.html
The output of the following command might help:
scontrol show job 465072
Regards
--
Mick Timony
Se
At HMS we do the same as Paul's cluster and specify the groups we want to have
access to all our compute nodes, we allow two groups that represent our DevOps
team and our Research Computing consultants to have access and then
corresponding sudo rules for each group to allow different command se
We do something very similar at HMS. For instance our nodes with 257468MB of
RAM we round down RealMemory to 257000MB, for nodes with 1031057MB of RAM we
round down to 100 etc.
We may tune this on our next OS and Slurm update as I expect to see more memory
used by the OS as we migrating to