Dear Slurm Community, We recognize that the SlurmdTimeout has a default value of 300 seconds, and that if the controller is unable to communicate with a node during that time it will mark it down. We have two questions regarding this:
1. Won't also individual compute nodes kill their own jobs if they aren't able to communicate with a controller in so many minutes? If so, is that controlled by the same SlurmdTimeout or is that a different timeout parameter? 2. Are there any major scheduling or performance implications to increasing these values, aside from the obvious potential to schedule a job on a node that is down? Thanks so much, __________________________________________________ *Jacob D. Chappell, CSM* Research Computing | Research Computing Infrastructure Information Technology Services | University of Kentucky jacob.chapp...@uky.edu