[slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

Robert Kudyba Mon, 30 Nov 2020 09:54:34 -0800

I've seen where this was a bug that was fixed
https://bugs.schedmd.com/show_bug.cgi?id=3941 but this happens occasionally
still. A user cancels his/her job and a node gets drained.
UnkillableStepTimeout=120 is set in slurm.conf


Slurm 20.02.3 on Centos 7.9 running on Bright Cluster 8.2

Slurm Job_id=6908 Name=run.sh Ended, Run time 7-17:50:36, CANCELLED,
ExitCode 0
Resending TERMINATE_JOB request JobId=6908 Nodelist=node001
update_node: node node001 reason set to: Kill task failed
update_node: node node001 state set to DRAINING
error: slurmd error running JobId=6908 on node(s)=node001: Kill task failed

update_node: node node001 reason set to: hung
update_node: node node001 state set to DOWN
update_node: node node001 state set to IDLE
error: Nodes node001 not responding

scontrol show config | grep kill
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 120 sec

Do we just increase the timeout value?

[slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

Reply via email to