That can help.  Usually this happens due to laggy storage the job is using taking time flushing the job's data.  So making sure that your storage is up, responsive, and stable will also cut these down.

-Paul Edmon-

On 11/30/2020 12:52 PM, Robert Kudyba wrote:
I've seen where this was a bug that was fixed https://bugs.schedmd.com/show_bug.cgi?id=3941 <https://bugs.schedmd.com/show_bug.cgi?id=3941> but this happens occasionally still. A user cancels his/her job and a node gets drained. UnkillableStepTimeout=120 is set in slurm.conf

Slurm 20.02.3 on Centos 7.9 running on Bright Cluster 8.2

Slurm Job_id=6908 Name=run.sh Ended, Run time 7-17:50:36, CANCELLED, ExitCode 0
Resending TERMINATE_JOB request JobId=6908 Nodelist=node001
update_node: node node001 reason set to: Kill task failed
update_node: node node001 state set to DRAINING
error: slurmd error running JobId=6908 on node(s)=node001: Kill task failed

update_node: node node001 reason set to: hung
update_node: node node001 state set to DOWN
update_node: node node001 state set to IDLE
error: Nodes node001 not responding

scontrol show config | grep kill
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 120 sec

Do we just increase the timeout value?

Reply via email to