Agreed. You may also want to write a script that gather the list of program in "D state" (kernel wait) and print their stack; and configure it as UnkillableStepProgram so that you can capture the program and relevant system callS that caused the job to become unkillable / timed out exiting for further troubleshooting.
Regards, Angelos (Sent from mobile, please pardon me for typos and cursoriness.) > 2020/07/23 0:41、Ryan Cox <ryan_...@byu.edu>のメール: > > Ivan, > > Are you having I/O slowness? That is the most common cause for us. If it's > not that, you'll want to look through all the reasons that it takes a long > time for a process to actually die after a SIGKILL because one of those is > the likely cause. Typically it's because the process is waiting for an I/O > syscall to return. Sometimes swap death is the culprit, but usually not at > the scale that you stated. Maybe you could try reproducing the issue > manually or putting something in epilog the see the state of the processes in > the job's cgroup. > > Ryan > > On 7/22/20 10:24 AM, Ivan Kovanda wrote: >> Dear slurm community, >> >> Currently running slurm version 18.08.4 >> >> We have been experiencing an issue causing any nodes a slurm job was >> submitted to to "drain". >> From what I've seen, it appears that there is a problem with how slurm is >> cleaning up the job with the SIGKILL process. >> >> I've found this slurm article >> (https://slurm.schedmd.com/troubleshoot.html#completing) , which has a >> section titled "Jobs and nodes are stuck in COMPLETING state", where it >> recommends increasing the "UnkillableStepTimeout" in the slurm.conf , but >> all that has done is prolong the time it takes for the job to timeout. >> The default time for the "UnkillableStepTimeout" is 60 seconds. >> >> After the job completes, it stays in the CG (completing) status for the 60 >> seconds, then the nodes the job was submitted to go to drain status. >> >> On the headnode running slurmctld, I am seeing this in the log - >> /var/log/slurmctld: >> -------------------------------------------------------------------------------------------------------------------------------------------- >> [2020-07-21T22:40:03.000] update_node: node node001 reason set to: Kill task >> failed >> [2020-07-21T22:40:03.001] update_node: node node001 state set to DRAINING >> >> On the compute node, I am seeing this in the log - /var/log/slurmd >> -------------------------------------------------------------------------------------------------------------------------------------------- >> [2020-07-21T22:38:33.110] [1485.batch] done with job >> [2020-07-21T22:38:33.110] [1485.extern] Sent signal 18 to 1485.4294967295 >> [2020-07-21T22:38:33.111] [1485.extern] Sent signal 15 to 1485.4294967295 >> [2020-07-21T22:39:02.820] [1485.extern] Sent SIGKILL signal to >> 1485.4294967295 >> [2020-07-21T22:40:03.000] [1485.extern] error: *** EXTERN STEP FOR 1485 >> STEPD TERMINATED ON node001 AT 2020-07-21T22:40:02 DUE TO JOB NOT ENDING >> WITH SIGNALS *** >> >> >> I've tried restarting the SLURMD daemon on the compute nodes, and even >> completing rebooting a few computes nodes (node001, node002) . >> From what I've seen were experiencing this on all nodes in the cluster. >> I've yet to restart the headnode because there are still active jobs on the >> system so I don't want to interrupt those. >> >> >> Thank you for your time, >> Ivan >> >