[slurm-users] Re: Increasing SlurmdTimeout beyond 300 Seconds

2024-02-12 Thread Timony, Mick via slurm-users
We set SlurmdTimeout=600​. The docs say not to go any higher than 65533 seconds: https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdTimeout The FAQ has info about SlurmdTimeout also. The worst thing that could happen is will take longer to set nodes as being down: >A node is set DOWN when the s

[slurm-users] Re: Job submitted to multiple partitions not running when any partition is full

2024-07-09 Thread Timony, Mick via slurm-users
Hi Paul, There could be multiple reasons why the job isn't running, from the user's QOS to your cluster hitting MaxJobCount. This page might help: https://slurm.schedmd.com/high_throughput.html The output of the following command might help: scontrol show job 465072​ Regards -- Mick Timony Se

[slurm-users] Re: Temporarily bypassing pam_slurm_adopt.so

2024-07-09 Thread Timony, Mick via slurm-users
At HMS we do the same as Paul's cluster and specify the groups we want to have access to all our compute nodes, we allow two groups that represent our DevOps team and our Research Computing consultants to have access and then corresponding sudo rules for each group to allow different command se

[slurm-users] Re: Do I have to hold back RAM for worker nodes?

2025-05-12 Thread Timony, Mick via slurm-users
We do something very similar at HMS. For instance our nodes with 257468MB of RAM we round down RealMemory to 257000MB, for nodes with 1031057MB of RAM we round down to 100 etc. We may tune this on our next OS and Slurm update as I expect to see more memory used by the OS as we migrating to