It can also happen if you have a stalled out filesystem or stuck
processes. I've gotten in the habit of doing a daily patrol for them to
clean them up. Most of them time you can just reopen the node but
sometimes this indicates something is wedged.
-Paul Edmon-
On 10/22/2019 5:22 PM, Riebs, Andy wrote:
A common reason for seeing this is if a process is dropping core -- the
kernel will ignore job kill requests until that is complete, so the job isn't
being killed as quickly as Slurm would like. I typically recommend increasing
the UnkillableTaskWait from 60 seconds to 120 or 180 seconds to avoid this.
Andy
-----Original Message-----
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of
Will Dennis
Sent: Tuesday, October 22, 2019 4:59 PM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] Nodes going into drain because of "Kill task failed"
Hi all,
I have a number of nodes on one of my 17.11.7 clusters in drain mode on account of
reason: "Kill task failed”
I see the following in slurmd.log —
[2019-10-17T20:06:43.027] [34443.0] error: *** STEP 34443.0 ON server15
CANCELLED AT 2019-10-17T20:06:43 DUE TO TIME LIMIT ***
[2019-10-17T20:06:43.029] [34443.0] Sent signal 15 to 34443.0
[2019-10-17T20:06:43.029] Job 34443: timeout: sent SIGTERM to 1 active steps
[2019-10-17T20:06:43.031] [34443.0] Sent signal 18 to 34443.0
[2019-10-17T20:06:43.032] [34443.0] Sent signal 15 to 34443.0
[2019-10-17T20:06:43.036] [34443.0] task 0 (8741) exited. Killed by signal 15.
[2019-10-17T20:06:43.036] [34443.0] Step 34443.0 hit memory limit at least once
during execution. This may or may not result in some failure.
[2019-10-17T20:07:13.048] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:15.051] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:16.053] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:17.055] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:18.057] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:19.059] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:20.061] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:21.063] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:22.065] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:23.066] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:24.069] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:34.071] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:44.000] [34443.0] error: *** STEP 34443.0 STEPD TERMINATED ON
server15 AT 2019-10-17T20:07:43 DUE TO JOB NOT ENDING WITH SIGNALS ***
[2019-10-17T20:07:44.001] [34443.0] error: Failed to send MESSAGE_TASK_EXIT:
Connection refused
[2019-10-17T20:07:44.004] [34443.0] done with job
From the above, it seems like the step time limit was reached, and signal 15
(SIGTERM) was sent to the process, which seems to have succeeded at
2019-10-17T20:06:43.036, but I guess not from the series of SIGKILLs thereafter
sent?
What may be the cause of this, and how to prevent this from happening?
Thanks,
Will