Hi all,

I have a number of nodes on one of my 17.11.7 clusters in drain mode on account 
of reason: "Kill task failed”

I see the following in slurmd.log —

[2019-10-17T20:06:43.027] [34443.0] error: *** STEP 34443.0 ON server15 
CANCELLED AT 2019-10-17T20:06:43 DUE TO TIME LIMIT ***
[2019-10-17T20:06:43.029] [34443.0] Sent signal 15 to 34443.0
[2019-10-17T20:06:43.029] Job 34443: timeout: sent SIGTERM to 1 active steps
[2019-10-17T20:06:43.031] [34443.0] Sent signal 18 to 34443.0
[2019-10-17T20:06:43.032] [34443.0] Sent signal 15 to 34443.0
[2019-10-17T20:06:43.036] [34443.0] task 0 (8741) exited. Killed by signal 15.
[2019-10-17T20:06:43.036] [34443.0] Step 34443.0 hit memory limit at least once 
during execution. This may or may not result in some failure.
[2019-10-17T20:07:13.048] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:15.051] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:16.053] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:17.055] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:18.057] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:19.059] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:20.061] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:21.063] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:22.065] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:23.066] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:24.069] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:34.071] [34443.0] Sent SIGKILL signal to 34443.0
[2019-10-17T20:07:44.000] [34443.0] error: *** STEP 34443.0 STEPD TERMINATED ON 
server15 AT 2019-10-17T20:07:43 DUE TO JOB NOT ENDING WITH SIGNALS ***
[2019-10-17T20:07:44.001] [34443.0] error: Failed to send MESSAGE_TASK_EXIT: 
Connection refused
[2019-10-17T20:07:44.004] [34443.0] done with job

From the above, it seems like the step time limit was reached, and signal 15 
(SIGTERM) was sent to the process, which seems to have succeeded at 
2019-10-17T20:06:43.036, but I guess not from the series of SIGKILLs thereafter 
sent?

What may be the cause of this, and how to prevent this from happening?

Thanks,
Will

Reply via email to