[slurm-users] SlurmdTimeout and keeping jobs running

Jacob Chappell Fri, 07 Aug 2020 12:12:22 -0700

Dear Slurm Community,

We recognize that the SlurmdTimeout has a default value of 300 seconds, and
that if the controller is unable to communicate with a node during that
time it will mark it down. We have two questions regarding this:


1. Won't also individual compute nodes kill their own jobs if they aren't
able to communicate with a controller in so many minutes? If so, is that
controlled by the same SlurmdTimeout or is that a different timeout
parameter?

2. Are there any major scheduling or performance implications to increasing
these values, aside from the obvious potential to schedule a job on a node
that is down?

Thanks so much,
__________________________________________________
*Jacob D. Chappell, CSM*
Research Computing | Research Computing Infrastructure
Information Technology Services | University of Kentucky
jacob.chapp...@uky.edu

[slurm-users] SlurmdTimeout and keeping jobs running

Reply via email to