Excellent points raised here! Two other things to do when you see "kill task failed":
1. Check "dmesg -T" on the suspect node to look for significant system events, like file system problems, communication problems, etc., around the time that the problem was logged 2. Check /var/log/slurm (or whatever is appropriate on you system)for core files that correspond to the time reported for "kill task failed" Andy -----Original Message----- From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Marcus Boden Sent: Wednesday, October 23, 2019 2:34 AM To: Slurm User Community List <slurm-users@lists.schedmd.com> Subject: Re: [slurm-users] Nodes going into drain because of "Kill task failed" you can also use the UnkillableStepProgram to debug things: > UnkillableStepProgram > If the processes in a job step are determined to be unkillable for a > period of time specified by the UnkillableStepTimeout variable, the program > specified by UnkillableStepProgram will be executed. This program can be used > to take special actions to clean up the unkillable processes and/or notify > computer administrators. The program will be run SlurmdUser (usually "root") > on the compute node. By default no program is run. > UnkillableStepTimeout > The length of time, in seconds, that Slurm will wait before deciding that > processes in a job step are unkillable (after they have been signaled with > SIGKILL) and execute UnkillableStepProgram as described above. The default > timeout value is 60 seconds. If exceeded, the compute node will be drained to > prevent future jobs from being scheduled on the node. this allows you to find out what causes the problem at the time, when the Problem occurs. You could for example use lsof to see if there are any files open due to a hanging fs and mail the output to yourself. Best, Marcus On 19-10-22 20:49, Paul Edmon wrote: > It can also happen if you have a stalled out filesystem or stuck processes. > I've gotten in the habit of doing a daily patrol for them to clean them up. > Most of them time you can just reopen the node but sometimes this indicates > something is wedged. > > -Paul Edmon- > > On 10/22/2019 5:22 PM, Riebs, Andy wrote: > > A common reason for seeing this is if a process is dropping core -- the > > kernel will ignore job kill requests until that is complete, so the job > > isn't being killed as quickly as Slurm would like. I typically recommend > > increasing the UnkillableTaskWait from 60 seconds to 120 or 180 seconds to > > avoid this. > > > > Andy > > > > -----Original Message----- > > From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf > > Of Will Dennis > > Sent: Tuesday, October 22, 2019 4:59 PM > > To: slurm-users@lists.schedmd.com > > Subject: [slurm-users] Nodes going into drain because of "Kill task failed" > > > > Hi all, > > > > I have a number of nodes on one of my 17.11.7 clusters in drain mode on > > account of reason: "Kill task failed” > > > > I see the following in slurmd.log — > > > > [2019-10-17T20:06:43.027] [34443.0] error: *** STEP 34443.0 ON server15 > > CANCELLED AT 2019-10-17T20:06:43 DUE TO TIME LIMIT *** > > [2019-10-17T20:06:43.029] [34443.0] Sent signal 15 to 34443.0 > > [2019-10-17T20:06:43.029] Job 34443: timeout: sent SIGTERM to 1 active steps > > [2019-10-17T20:06:43.031] [34443.0] Sent signal 18 to 34443.0 > > [2019-10-17T20:06:43.032] [34443.0] Sent signal 15 to 34443.0 > > [2019-10-17T20:06:43.036] [34443.0] task 0 (8741) exited. Killed by signal > > 15. > > [2019-10-17T20:06:43.036] [34443.0] Step 34443.0 hit memory limit at least > > once during execution. This may or may not result in some failure. > > [2019-10-17T20:07:13.048] [34443.0] Sent SIGKILL signal to 34443.0 > > [2019-10-17T20:07:15.051] [34443.0] Sent SIGKILL signal to 34443.0 > > [2019-10-17T20:07:16.053] [34443.0] Sent SIGKILL signal to 34443.0 > > [2019-10-17T20:07:17.055] [34443.0] Sent SIGKILL signal to 34443.0 > > [2019-10-17T20:07:18.057] [34443.0] Sent SIGKILL signal to 34443.0 > > [2019-10-17T20:07:19.059] [34443.0] Sent SIGKILL signal to 34443.0 > > [2019-10-17T20:07:20.061] [34443.0] Sent SIGKILL signal to 34443.0 > > [2019-10-17T20:07:21.063] [34443.0] Sent SIGKILL signal to 34443.0 > > [2019-10-17T20:07:22.065] [34443.0] Sent SIGKILL signal to 34443.0 > > [2019-10-17T20:07:23.066] [34443.0] Sent SIGKILL signal to 34443.0 > > [2019-10-17T20:07:24.069] [34443.0] Sent SIGKILL signal to 34443.0 > > [2019-10-17T20:07:34.071] [34443.0] Sent SIGKILL signal to 34443.0 > > [2019-10-17T20:07:44.000] [34443.0] error: *** STEP 34443.0 STEPD > > TERMINATED ON server15 AT 2019-10-17T20:07:43 DUE TO JOB NOT ENDING WITH > > SIGNALS *** > > [2019-10-17T20:07:44.001] [34443.0] error: Failed to send > > MESSAGE_TASK_EXIT: Connection refused > > [2019-10-17T20:07:44.004] [34443.0] done with job > > > > From the above, it seems like the step time limit was reached, and signal > > 15 (SIGTERM) was sent to the process, which seems to have succeeded at > > 2019-10-17T20:06:43.036, but I guess not from the series of SIGKILLs > > thereafter sent? > > > > What may be the cause of this, and how to prevent this from happening? > > > > Thanks, > > Will > -- Marcus Vincent Boden, M.Sc. Arbeitsgruppe eScience Tel.: +49 (0)551 201-2191 E-Mail: mbo...@gwdg.de --------------------------------------- Gesellschaft fuer wissenschaftliche Datenverarbeitung mbH Goettingen (GWDG) Am Fassberg 11, 37077 Goettingen URL: http://www.gwdg.de E-Mail: g...@gwdg.de Tel.: +49 (0)551 201-1510 Fax: +49 (0)551 201-2150 Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour Aufsichtsratsvorsitzender: Prof. Dr. Christian Griesinger Sitz der Gesellschaft: Goettingen Registergericht: Goettingen Handelsregister-Nr. B 598 ---------------------------------------