Hi;
Are you sure this is a job task completing issue. When the epilog script
fails, slurm will set node to DRAIN state:
"If the Epilog fails (returns a non-zero exit code), this will result in
the node being set to a DRAIN state"
https://slurm.schedmd.com/prolog_epilog.html
You can test this possibility by adding a "exit 0" line at the end of
the epilog script.
Regards;
Ahmet M.
23.07.2020 18:34 tarihinde Ivan Kovanda yazdı:
Thanks for the input guys!
We don’t even use lustre filesystems…and It doesn’t appear to be I/O.
I execute *iostat* on both head node and compute node when the job is
in CG status and the %iowait value is 0.00 or 0.01
$ iostat
Linux 3.10.0-957.el7.x86_64 (node002) 07/22/2020 _x86_64_
(32 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
0.01 0.00 0.01 0.00 0.00 99.98
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 0.82 14.09 2.39 1157160 196648
Also tried the following command to see if I can identify any
processes in D state on the compute node but no results:
ps aux | awk '$8 ~ /D/ { print $0 }'
This ones got me stumped…
Sorry I’m not too familiar with epilog yet; do you have any examples
of how I would use that to log the SIGKILL event ?
Thanks again,
Ivan
*From:* slurm-users <slurm-users-boun...@lists.schedmd.com> *On Behalf
Of *Paul Edmon
*Sent:* Thursday, July 23, 2020 7:19 AM
*To:* slurm-users@lists.schedmd.com
*Subject:* Re: [slurm-users] Nodes going into drain because of "Kill
task failed"
Same here. Whenever we see rashes of Kill task failed it is
invariably symptomatic of one of our Lustre filesystems acting up or
being saturated.
-Paul Edmon-
On 7/22/2020 3:21 PM, Ryan Cox wrote:
Angelos,
I'm glad you mentioned UnkillableStepProgram. We meant to look at
that a while ago but forgot about it. That will be very useful
for us as well, though the answer for us is pretty much always
Lustre problems.
Ryan
On 7/22/20 1:02 PM, Angelos Ching wrote:
Agreed. You may also want to write a script that gather the
list of program in "D state" (kernel wait) and print their
stack; and configure it as UnkillableStepProgram so that you
can capture the program and relevant system callS that caused
the job to become unkillable / timed out exiting for further
troubleshooting.
Regards,
Angelos
(Sent from mobile, please pardon me for typos and cursoriness.)
2020/07/23 0:41、Ryan Cox <ryan_...@byu.edu>
<mailto:ryan_...@byu.edu>のメール:
Ivan,
Are you having I/O slowness? That is the most common cause
for us. If it's not that, you'll want to look through all
the reasons that it takes a long time for a process to
actually die after a SIGKILL because one of those is the
likely cause. Typically it's because the process is
waiting for an I/O syscall to return. Sometimes swap death
is the culprit, but usually not at the scale that you
stated. Maybe you could try reproducing the issue
manually or putting something in epilog the see the state
of the processes in the job's cgroup.
Ryan
On 7/22/20 10:24 AM, Ivan Kovanda wrote:
Dear slurm community,
Currently running slurm version 18.08.4
We have been experiencing an issue causing any nodes a
slurm job was submitted to to "drain".
From what I've seen, it appears that there is a
problem with how slurm is cleaning up the job with the
SIGKILL process.
I've found this slurm article
(https://slurm.schedmd.com/troubleshoot.html#completing
<https://urldefense.com/v3/__https:/slurm.schedmd.com/troubleshoot.html*completing__;Iw!!NCZxaNi9jForCP_SxBKJCA!FOsRehxg6w3PLipsOItVBSjYhPtRzmQnBUQen6C13v85kgef1cZFdtwuP9zG1sgAEQ$>)
, which has a section titled "Jobs and nodes are stuck
in COMPLETING state", where it recommends increasing
the "UnkillableStepTimeout" in the slurm.conf , but
all that has done is prolong the time it takes for the
job to timeout.
The default time for the "UnkillableStepTimeout" is 60
seconds.
After the job completes, it stays in the CG
(completing) status for the 60 seconds, then the nodes
the job was submitted to go to drain status.
On the headnode running slurmctld, I am seeing this in
the log - /var/log/slurmctld:
--------------------------------------------------------------------------------------------------------------------------------------------
[2020-07-21T22:40:03.000] update_node: node node001
reason set to: Kill task failed
[2020-07-21T22:40:03.001] update_node: node node001
state set to DRAINING
On the compute node, I am seeing this in the log -
/var/log/slurmd
--------------------------------------------------------------------------------------------------------------------------------------------
[2020-07-21T22:38:33.110] [1485.batch] done with job
[2020-07-21T22:38:33.110] [1485.extern] Sent signal 18
to 1485.4294967295
[2020-07-21T22:38:33.111] [1485.extern] Sent signal 15
to 1485.4294967295
[2020-07-21T22:39:02.820] [1485.extern] Sent SIGKILL
signal to 1485.4294967295
[2020-07-21T22:40:03.000] [1485.extern] error: ***
EXTERN STEP FOR 1485 STEPD TERMINATED ON node001 AT
2020-07-21T22:40:02 DUE TO JOB NOT ENDING WITH SIGNALS ***
I've tried restarting the SLURMD daemon on the compute
nodes, and even completing rebooting a few computes
nodes (node001, node002) .
From what I've seen were experiencing this on all
nodes in the cluster.
I've yet to restart the headnode because there are
still active jobs on the system so I don't want to
interrupt those.
Thank you for your time,
Ivan