Re: [slurm-users] Nodes going into drain because of "Kill task failed"

Paul Edmon Thu, 23 Jul 2020 06:22:58 -0700

Same here. Whenever we see rashes of Kill task failed it is invariablysymptomatic of one of our Lustre filesystems acting up or being saturated.


-Paul Edmon-


On 7/22/2020 3:21 PM, Ryan Cox wrote:

Angelos,
I'm glad you mentioned UnkillableStepProgram. We meant to look atthat a while ago but forgot about it. That will be very useful for usas well, though the answer for us is pretty much always Lustre problems.
Ryan

On 7/22/20 1:02 PM, Angelos Ching wrote:
Agreed. You may also want to write a script that gather the list ofprogram in "D state" (kernel wait) and print their stack; andconfigure it as UnkillableStepProgram so that you can capture theprogram and relevant system callS that caused the job to becomeunkillable / timed out exiting for further troubleshooting.
Regards,
Angelos
(Sent from mobile, please pardon me for typos and cursoriness.)
2020/07/23 0:41、Ryan Cox <ryan_...@byu.edu>のメール:

 Ivan,
Are you having I/O slowness? That is the most common cause for us.If it's not that, you'll want to look through all the reasons thatit takes a long time for a process to actually die after a SIGKILLbecause one of those is the likely cause. Typically it's because theprocess is waiting for an I/O syscall to return. Sometimes swapdeath is the culprit, but usually not at the scale that you stated. Maybe you could try reproducing the issue manually or puttingsomething in epilog the see the state of the processes in the job'scgroup.
Ryan

On 7/22/20 10:24 AM, Ivan Kovanda wrote:
Dear slurm community,

Currently running slurm version 18.08.4
We have been experiencing an issue causing any nodes a slurm jobwas submitted to to "drain".
From what I've seen, it appears that there is a problem with howslurm is cleaning up the job with the SIGKILL process.
I've found this slurm article(https://slurm.schedmd.com/troubleshoot.html#completing) , whichhas a section titled "Jobs and nodes are stuck in COMPLETINGstate", where it recommends increasing the "UnkillableStepTimeout"in the slurm.conf , but all that has done is prolong the time ittakes for the job to timeout.
The default time for the "UnkillableStepTimeout" is 60 seconds.
After the job completes, it stays in the CG (completing) status forthe 60 seconds, then the nodes the job was submitted to go to drainstatus.
On the headnode running slurmctld, I am seeing this in the log -/var/log/slurmctld:
--------------------------------------------------------------------------------------------------------------------------------------------
[2020-07-21T22:40:03.000] update_node: node node001 reason set to:Kill task failed
[2020-07-21T22:40:03.001] update_node: node node001 state set toDRAINING
On the compute node, I am seeing this in the log - /var/log/slurmd

--------------------------------------------------------------------------------------------------------------------------------------------

[2020-07-21T22:38:33.110] [1485.batch] done with job
[2020-07-21T22:38:33.110] [1485.extern] Sent signal 18 to1485.4294967295
[2020-07-21T22:38:33.111] [1485.extern] Sent signal 15 to1485.4294967295
[2020-07-21T22:39:02.820] [1485.extern] Sent SIGKILL signal to1485.4294967295
[2020-07-21T22:40:03.000] [1485.extern] error: *** EXTERN STEP FOR1485 STEPD TERMINATED ON node001 AT 2020-07-21T22:40:02 DUE TO JOBNOT ENDING WITH SIGNALS ***
I've tried restarting the SLURMD daemon on the compute nodes, andeven completing rebooting a few computes nodes (node001, node002) .
From what I've seen were experiencing this on all nodes in thecluster.
I've yet to restart the headnode because there are still activejobs on the system so I don't want to interrupt those.
Thank you for your time,

Ivan

Re: [slurm-users] Nodes going into drain because of "Kill task failed"

Reply via email to