I've seen where this was a bug that was fixed https://bugs.schedmd.com/show_bug.cgi?id=3941 but this happens occasionally still. A user cancels his/her job and a node gets drained. UnkillableStepTimeout=120 is set in slurm.conf
Slurm 20.02.3 on Centos 7.9 running on Bright Cluster 8.2 Slurm Job_id=6908 Name=run.sh Ended, Run time 7-17:50:36, CANCELLED, ExitCode 0 Resending TERMINATE_JOB request JobId=6908 Nodelist=node001 update_node: node node001 reason set to: Kill task failed update_node: node node001 state set to DRAINING error: slurmd error running JobId=6908 on node(s)=node001: Kill task failed update_node: node node001 reason set to: hung update_node: node node001 state set to DOWN update_node: node node001 state set to IDLE error: Nodes node001 not responding scontrol show config | grep kill UnkillableStepProgram = (null) UnkillableStepTimeout = 120 sec Do we just increase the timeout value?