Re: [slurm-users] Hung tasks and high load when cancelling jobs

2018-05-05 Thread Chris Samuel
On Thursday, 3 May 2018 1:23:44 PM AEST Brendan Moloney wrote: > I upgraded somewhat recently from 17.02 to 17.11, but I am not positive if > this bug is new or just went unnoticed previously. There is a known deadlock bug in 17.11.x which can happen for certain workloads, hopefully fixed in 17.

[slurm-users] Hung tasks and high load when cancelling jobs

2018-05-02 Thread Brendan Moloney
Hi, Sometimes when jobs are cancelled I see a spike in system load and hung task errors. It appears to be related to NFS and cgroups. The slurmstepd process gets hung cleaning up cgroups: INFO: task slurmstepd:11222 blocked for more than 120 seconds. Not tainted 4.4.0-119-generic #143-Ubun