Hi;

Are you sure this is a job task completing issue. When the epilog script fails, slurm will set node to DRAIN state:

"If the Epilog fails (returns a non-zero exit code), this will result in the node being set to a DRAIN state"

https://slurm.schedmd.com/prolog_epilog.html

You can test this possibility by adding a "exit 0" line at the end of the epilog script.

Regards;

Ahmet M.


23.07.2020 18:34 tarihinde Ivan Kovanda yazdı:

Thanks for the input guys!

We don’t even use lustre filesystems…and It doesn’t appear to be I/O.

I execute *iostat* on both head node and compute node when the job is in CG status and the %iowait value is 0.00 or 0.01

$ iostat

Linux 3.10.0-957.el7.x86_64 (node002)   07/22/2020      _x86_64_        (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle

           0.01    0.00    0.01 0.00    0.00   99.98

Device:            tps kB_read/s    kB_wrtn/s    kB_read    kB_wrtn

sda               0.82 14.09         2.39    1157160     196648

Also tried the following command to see if I can identify any processes in D state on the compute node but no results:

ps aux | awk '$8 ~ /D/  { print $0 }'

This ones got me stumped…

Sorry I’m not too familiar with epilog yet; do you have any examples of how I would use that to log the SIGKILL event ?

Thanks again,
Ivan

*From:* slurm-users <slurm-users-boun...@lists.schedmd.com> *On Behalf Of *Paul Edmon
*Sent:* Thursday, July 23, 2020 7:19 AM
*To:* slurm-users@lists.schedmd.com
*Subject:* Re: [slurm-users] Nodes going into drain because of "Kill task failed"

Same here.  Whenever we see rashes of Kill task failed it is invariably symptomatic of one of our Lustre filesystems acting up or being saturated.

-Paul Edmon-

On 7/22/2020 3:21 PM, Ryan Cox wrote:

    Angelos,

    I'm glad you mentioned UnkillableStepProgram.  We meant to look at
    that a while ago but forgot about it.  That will be very useful
    for us as well, though the answer for us is pretty much always
    Lustre problems.

    Ryan

    On 7/22/20 1:02 PM, Angelos Ching wrote:

        Agreed. You may also want to write a script that gather the
        list of program in "D state" (kernel wait) and print their
        stack; and configure it as UnkillableStepProgram so that you
        can capture the program and relevant system callS that caused
        the job to become unkillable / timed out exiting for further
        troubleshooting.


        Regards,

        Angelos

        (Sent from mobile, please pardon me for typos and cursoriness.)



            2020/07/23 0:41、Ryan Cox <ryan_...@byu.edu>
            <mailto:ryan_...@byu.edu>のメール:

             Ivan,

            Are you having I/O slowness? That is the most common cause
            for us. If it's not that, you'll want to look through all
            the reasons that it takes a long time for a process to
            actually die after a SIGKILL because one of those is the
            likely cause. Typically it's because the process is
            waiting for an I/O syscall to return. Sometimes swap death
            is the culprit, but usually not at the scale that you
            stated.  Maybe you could try reproducing the issue
            manually or putting something in epilog the see the state
            of the processes in the job's cgroup.

            Ryan

            On 7/22/20 10:24 AM, Ivan Kovanda wrote:

                Dear slurm community,

                Currently running slurm version 18.08.4

                We have been experiencing an issue causing any nodes a
                slurm job was submitted to to "drain".

                From what I've seen, it appears that there is a
                problem with how slurm is cleaning up the job with the
                SIGKILL process.

                I've found this slurm article
                (https://slurm.schedmd.com/troubleshoot.html#completing
                
<https://urldefense.com/v3/__https:/slurm.schedmd.com/troubleshoot.html*completing__;Iw!!NCZxaNi9jForCP_SxBKJCA!FOsRehxg6w3PLipsOItVBSjYhPtRzmQnBUQen6C13v85kgef1cZFdtwuP9zG1sgAEQ$>)
                , which has a section titled "Jobs and nodes are stuck
                in COMPLETING state", where it recommends increasing
                the "UnkillableStepTimeout" in the slurm.conf , but
                all that has done is prolong the time it takes for the
                job to timeout.

                The default time for the "UnkillableStepTimeout" is 60
                seconds.

                After the job completes, it stays in the CG
                (completing) status for the 60 seconds, then the nodes
                the job was submitted to go to drain status.

                On the headnode running slurmctld, I am seeing this in
                the log - /var/log/slurmctld:

                
--------------------------------------------------------------------------------------------------------------------------------------------

                [2020-07-21T22:40:03.000] update_node: node node001
                reason set to: Kill task failed

                [2020-07-21T22:40:03.001] update_node: node node001
                state set to DRAINING

                On the compute node, I am seeing this in the log -
                /var/log/slurmd

                
--------------------------------------------------------------------------------------------------------------------------------------------

                [2020-07-21T22:38:33.110] [1485.batch] done with job

                [2020-07-21T22:38:33.110] [1485.extern] Sent signal 18
                to 1485.4294967295

                [2020-07-21T22:38:33.111] [1485.extern] Sent signal 15
                to 1485.4294967295

                [2020-07-21T22:39:02.820] [1485.extern] Sent SIGKILL
                signal to 1485.4294967295

                [2020-07-21T22:40:03.000] [1485.extern] error: ***
                EXTERN STEP FOR 1485 STEPD TERMINATED ON node001 AT
                2020-07-21T22:40:02 DUE TO JOB NOT ENDING WITH SIGNALS ***

                I've tried restarting the SLURMD daemon on the compute
                nodes, and even completing rebooting a few computes
                nodes (node001, node002) .

                From what I've seen were experiencing this on all
                nodes in the cluster.

                I've yet to restart the headnode because there are
                still active jobs on the system so I don't want to
                interrupt those.

                Thank you for your time,

                Ivan


Reply via email to