Do you know if the job is actually being killed?   We had an issue on an older 
version of slurm whereby we got OOM errors but the tasks actually completed.  
The OOM came when the job exited and was a false error.

Also, there are several bug reports open right now about an issue similar to 
what you have described.   You can go to bugs.schedmd.com to look at those bug 
reports.

-Roger

From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of Sean 
Caron
Sent: Tuesday, August 10, 2021 4:01 PM
To: Slurm User Community List <slurm-users@lists.schedmd.com>; Sean Caron 
<sca...@umich.edu>
Subject: [slurm-users] Spurious OOM-kills with cgroups on 20.11.8?

Hi all,

Has anyone else observed jobs getting OOM-killed in 20.11.8 with cgroups that 
ran fine in previous versions like 20.10?

I've had a few reports from users after upgrading maybe six weeks ago that 
their jobs are getting OOM-killed when they haven't changed anything and the 
job ran to completion in the past with the same memory specification.

The most recent report I received today involved a job running a "cp" command 
getting OOM-killed. I have a hard time believing "cp" uses very much memory...

These machines are running various 5.4.x or 5.3.x Linux kernels.

I've had really good luck with the cgroups OOM-killer the last few years from 
keeping my nodes getting overwhelmed by runaway jobs. I'd hate to have to 
disable it just to clean up these weird issues.

My cgroup.conf file looks like the following:

CgroupAutomount=yes

ConstrainCores=yes

ConstrainRAMSpace=yes
ConstrainSwapSpace=yes

AllowedRamSpace=100
AllowedSwapSpace=0

Should I maybe bump AllowedRamSpace? I don't see how this is any different than 
just asking the user to re-run the job with a larger memory allocation request. 
And that doesn't explain why jobs suddenly need more memory before getting 
OOM-killed than they used to.

Thanks,

Sean

-----------------------------------------------------------------------------------

The information in this communication and any attachment is confidential and 
intended solely for the attention and use of the named addressee(s). All 
information and opinions expressed herein are subject to change without notice. 
This communication is not to be construed as an offer to sell or the 
solicitation of an offer to buy any security. Any such offer or solicitation 
can only be made by means of the delivery of a confidential private offering 
memorandum (which should be carefully reviewed for a complete description of 
investment strategies and risks). Any reliance one may place on the accuracy or 
validity of this information is at their own risk. Past performance is not 
necessarily indicative of the future results of an investment. All figures are 
estimated and unaudited unless otherwise noted. If you are not the intended 
recipient, or a person responsible for delivering this to the intended 
recipient, you are not authorized to and must not disclose, copy, distribute, 
or retain this message or any part of it. In this case, please notify the 
sender immediately at 713-333-5440

Reply via email to