Re: [slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

Paul Edmon Mon, 30 Nov 2020 10:03:29 -0800

That can help. Usually this happens due to laggy storage the job isusing taking time flushing the job's data. So making sure that yourstorage is up, responsive, and stable will also cut these down.


-Paul Edmon-


On 11/30/2020 12:52 PM, Robert Kudyba wrote:

I've seen where this was a bug that was fixedhttps://bugs.schedmd.com/show_bug.cgi?id=3941<https://bugs.schedmd.com/show_bug.cgi?id=3941> but this happensoccasionally still. A user cancels his/her job and a node getsdrained. UnkillableStepTimeout=120 is set in slurm.conf
Slurm 20.02.3 on Centos 7.9 running on Bright Cluster 8.2
Slurm Job_id=6908 Name=run.sh Ended, Run time 7-17:50:36, CANCELLED,ExitCode 0
Resending TERMINATE_JOB request JobId=6908 Nodelist=node001
update_node: node node001 reason set to: Kill task failed
update_node: node node001 state set to DRAINING
error: slurmd error running JobId=6908 on node(s)=node001: Kill taskfailed
update_node: node node001 reason set to: hung
update_node: node node001 state set to DOWN
update_node: node node001 state set to IDLE
error: Nodes node001 not responding

scontrol show config | grep kill
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 120 sec

Do we just increase the timeout value?

Re: [slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

Reply via email to