Dear Slurm Community,

We recognize that the SlurmdTimeout has a default value of 300 seconds, and
that if the controller is unable to communicate with a node during that
time it will mark it down. We have two questions regarding this:

1. Won't also individual compute nodes kill their own jobs if they aren't
able to communicate with a controller in so many minutes? If so, is that
controlled by the same SlurmdTimeout or is that a different timeout
parameter?

2. Are there any major scheduling or performance implications to increasing
these values, aside from the obvious potential to schedule a job on a node
that is down?

Thanks so much,
__________________________________________________
*Jacob D. Chappell, CSM*
Research Computing | Research Computing Infrastructure
Information Technology Services | University of Kentucky
jacob.chapp...@uky.edu

Reply via email to