That can help. Usually this happens due to laggy storage the job is
using taking time flushing the job's data. So making sure that your
storage is up, responsive, and stable will also cut these down.
-Paul Edmon-
On 11/30/2020 12:52 PM, Robert Kudyba wrote:
I've seen where this was a bug that was fixed
https://bugs.schedmd.com/show_bug.cgi?id=3941
<https://bugs.schedmd.com/show_bug.cgi?id=3941> but this happens
occasionally still. A user cancels his/her job and a node gets
drained. UnkillableStepTimeout=120 is set in slurm.conf
Slurm 20.02.3 on Centos 7.9 running on Bright Cluster 8.2
Slurm Job_id=6908 Name=run.sh Ended, Run time 7-17:50:36, CANCELLED,
ExitCode 0
Resending TERMINATE_JOB request JobId=6908 Nodelist=node001
update_node: node node001 reason set to: Kill task failed
update_node: node node001 state set to DRAINING
error: slurmd error running JobId=6908 on node(s)=node001: Kill task
failed
update_node: node node001 reason set to: hung
update_node: node node001 state set to DOWN
update_node: node node001 state set to IDLE
error: Nodes node001 not responding
scontrol show config | grep kill
UnkillableStepProgram = (null)
UnkillableStepTimeout = 120 sec
Do we just increase the timeout value?