Re: [slurm-users] Nodes going into drain because of "Kill task failed"

Paul Edmon Tue, 22 Oct 2019 17:51:49 -0700

It can also happen if you have a stalled out filesystem or stuckprocesses. I've gotten in the habit of doing a daily patrol for them toclean them up. Most of them time you can just reopen the node butsometimes this indicates something is wedged.


-Paul Edmon-


On 10/22/2019 5:22 PM, Riebs, Andy wrote:

  A common reason for seeing this is if a process is dropping core -- the 
kernel will ignore job kill requests until that is complete, so the job isn't 
being killed as quickly as Slurm would like. I typically recommend increasing 
the UnkillableTaskWait from 60 seconds to 120 or 180 seconds to avoid this.

Andy

-----Original Message-----
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Will Dennis
Sent: Tuesday, October 22, 2019 4:59 PM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] Nodes going into drain because of "Kill task failed"

Hi all,

I have a number of nodes on one of my 17.11.7 clusters in drain mode on account of 
reason: "Kill task failed”

I see the following in slurmd.log —

[2019-10-17T20:06:43.027] [34443.0] error: *** STEP 34443.0 ON server15 
CANCELLED AT 2019-10-17T20:06:43 DUE TO TIME LIMIT ***
[2019-10-17T20:06:43.029] [34443.0] Sent signal 15 to 34443.0
[2019-10-17T20:06:43.029] Job 34443: timeout: sent SIGTERM to 1 active steps
[2019-10-17T20:06:43.031] [34443.0] Sent signal 18 to 34443.0
[2019-10-17T20:06:43.032] [34443.0] Sent signal 15 to 34443.0
[2019-10-17T20:06:43.036] [34443.0] task 0 (8741) exited. Killed by signal 15.
[2019-10-17T20:06:43.036] [34443.0] Step 34443.0 hit memory limit at least once 
during execution. This may or may not result in some failure.
[2019-10-17T20:07:13.048] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:15.051] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:16.053] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:17.055] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:18.057] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:19.059] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:20.061] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:21.063] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:22.065] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:23.066] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:24.069] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:34.071] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:44.000] [34443.0] error: *** STEP 34443.0 STEPD TERMINATED ON 
server15 AT 2019-10-17T20:07:43 DUE TO JOB NOT ENDING WITH SIGNALS ***
[2019-10-17T20:07:44.001] [34443.0] error: Failed to send MESSAGE_TASK_EXIT: 
Connection refused
[2019-10-17T20:07:44.004] [34443.0] done with job

 From the above, it seems like the step time limit was reached, and signal 15 
(SIGTERM) was sent to the process, which seems to have succeeded at 
2019-10-17T20:06:43.036, but I guess not from the series of SIGKILLs thereafter 
sent?

What may be the cause of this, and how to prevent this from happening?

Thanks,
Will

Re: [slurm-users] Nodes going into drain because of "Kill task failed"

Reply via email to