Hi all, Has anyone else observed jobs getting OOM-killed in 20.11.8 with cgroups that ran fine in previous versions like 20.10?
I've had a few reports from users after upgrading maybe six weeks ago that their jobs are getting OOM-killed when they haven't changed anything and the job ran to completion in the past with the same memory specification. The most recent report I received today involved a job running a "cp" command getting OOM-killed. I have a hard time believing "cp" uses very much memory... These machines are running various 5.4.x or 5.3.x Linux kernels. I've had really good luck with the cgroups OOM-killer the last few years from keeping my nodes getting overwhelmed by runaway jobs. I'd hate to have to disable it just to clean up these weird issues. My cgroup.conf file looks like the following: CgroupAutomount=yes ConstrainCores=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes AllowedRamSpace=100 AllowedSwapSpace=0 Should I maybe bump AllowedRamSpace? I don't see how this is any different than just asking the user to re-run the job with a larger memory allocation request. And that doesn't explain why jobs suddenly need more memory before getting OOM-killed than they used to. Thanks, Sean