On Thursday, 3 May 2018 1:23:44 PM AEST Brendan Moloney wrote:
> I upgraded somewhat recently from 17.02 to 17.11, but I am not positive if
> this bug is new or just went unnoticed previously.
There is a known deadlock bug in 17.11.x which can happen for certain
workloads, hopefully fixed in 17.
Hi,
Sometimes when jobs are cancelled I see a spike in system load and hung
task errors. It appears to be related to NFS and cgroups.
The slurmstepd process gets hung cleaning up cgroups:
INFO: task slurmstepd:11222 blocked for more than 120 seconds.
Not tainted 4.4.0-119-generic #143-Ubun