Hello,

We have just finished an upgrade to slurm 18.08. My last task was to reset the 
slurmctld/slurmd timeouts to sensible values -- as they were set prior to the 
update. That is..


SlurmctldTimeout        = 60 sec
SlurmdTimeout           = 300 sec


With slurm <18.08 I've reconfigure the cluster many times before without an 
issues. Yesterday I found that this commands "pushed" most of the compute nodes 
into a "NODE_FAIL" state resulting in the loss of most running jobs.


I'm wondering if anyone has seen anything like this on their cluster, and if so 
what the solution was. I would be interested in hearing your experiences, 
please. Maybe I need to revise/increase the timeout values -- this sort of 
issue is tricky to test on an active cluster


Best regards,

David

Reply via email to