Agreed. You may also want to write a script that gather the list of program in 
"D state" (kernel wait) and print their stack; and configure it as 
UnkillableStepProgram so that you can capture the program and relevant system 
callS that caused the job to become unkillable / timed out exiting for further 
troubleshooting.

Regards,
Angelos
(Sent from mobile, please pardon me for typos and cursoriness.)

> 2020/07/23 0:41、Ryan Cox <ryan_...@byu.edu>のメール:
> 
>  Ivan,
> 
> Are you having I/O slowness? That is the most common cause for us. If it's 
> not that, you'll want to look through all the reasons that it takes a long 
> time for a process to actually die after a SIGKILL because one of those is 
> the likely cause. Typically it's because the process is waiting for an I/O 
> syscall to return. Sometimes swap death is the culprit, but usually not at 
> the scale that you stated.  Maybe you could try reproducing the issue 
> manually or putting something in epilog the see the state of the processes in 
> the job's cgroup.
> 
> Ryan
> 
> On 7/22/20 10:24 AM, Ivan Kovanda wrote:
>> Dear slurm community,
>>  
>> Currently running slurm version 18.08.4
>>  
>> We have been experiencing an issue causing any nodes a slurm job was 
>> submitted to to "drain".
>> From what I've seen, it appears that there is a problem with how slurm is 
>> cleaning up the job with the SIGKILL process.
>>  
>> I've found this slurm article 
>> (https://slurm.schedmd.com/troubleshoot.html#completing) , which has a 
>> section titled "Jobs and nodes are stuck in COMPLETING state", where it 
>> recommends increasing the "UnkillableStepTimeout" in the slurm.conf , but 
>> all that has done is prolong the time it takes for the job to timeout.
>> The default time for the "UnkillableStepTimeout" is 60 seconds.
>>  
>> After the job completes, it stays in the CG (completing) status for the 60 
>> seconds, then the nodes the job was submitted to go to drain status.
>>  
>> On the headnode running slurmctld, I am seeing this in the log - 
>> /var/log/slurmctld:
>> --------------------------------------------------------------------------------------------------------------------------------------------
>> [2020-07-21T22:40:03.000] update_node: node node001 reason set to: Kill task 
>> failed
>> [2020-07-21T22:40:03.001] update_node: node node001 state set to DRAINING
>>  
>> On the compute node, I am seeing this in the log - /var/log/slurmd
>> --------------------------------------------------------------------------------------------------------------------------------------------
>> [2020-07-21T22:38:33.110] [1485.batch] done with job
>> [2020-07-21T22:38:33.110] [1485.extern] Sent signal 18 to 1485.4294967295
>> [2020-07-21T22:38:33.111] [1485.extern] Sent signal 15 to 1485.4294967295
>> [2020-07-21T22:39:02.820] [1485.extern] Sent SIGKILL signal to 
>> 1485.4294967295
>> [2020-07-21T22:40:03.000] [1485.extern] error: *** EXTERN STEP FOR 1485 
>> STEPD TERMINATED ON node001 AT 2020-07-21T22:40:02 DUE TO JOB NOT ENDING 
>> WITH SIGNALS ***
>>  
>>  
>> I've tried restarting the SLURMD daemon on the compute nodes, and even 
>> completing rebooting a few computes nodes (node001, node002) .
>> From what I've seen were experiencing this on all nodes in the cluster.
>> I've yet to restart the headnode because there are still active jobs on the 
>> system so I don't want to interrupt those.
>>  
>>  
>> Thank you for your time,
>> Ivan
>>  
> 

Reply via email to