Hi Roger, Thanks for the response. I am pretty sure the job is actually getting killed. I don't see it running in the process table and the local SLURM log just displays:
[2021-08-10T16:31:36.139] [6628753.batch] error: Detected 1 oom-kill event(s) in StepId=6628753.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler. Best, Sean On Tue, Aug 10, 2021 at 5:13 PM Roger Moye <rm...@quantlab.com> wrote: > Do you know if the job is actually being killed? We had an issue on an > older version of slurm whereby we got OOM errors but the tasks actually > completed. The OOM came when the job exited and was a false error. > > > > Also, there are several bug reports open right now about an issue similar > to what you have described. You can go to bugs.schedmd.com to look at > those bug reports. > > > > -Roger > > > > *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> *On Behalf Of > *Sean Caron > *Sent:* Tuesday, August 10, 2021 4:01 PM > *To:* Slurm User Community List <slurm-users@lists.schedmd.com>; Sean > Caron <sca...@umich.edu> > *Subject:* [slurm-users] Spurious OOM-kills with cgroups on 20.11.8? > > > > Hi all, > > > > Has anyone else observed jobs getting OOM-killed in 20.11.8 with cgroups > that ran fine in previous versions like 20.10? > > > > I've had a few reports from users after upgrading maybe six weeks ago that > their jobs are getting OOM-killed when they haven't changed anything and > the job ran to completion in the past with the same memory specification. > > > > The most recent report I received today involved a job running a "cp" > command getting OOM-killed. I have a hard time believing "cp" uses very > much memory... > > > > These machines are running various 5.4.x or 5.3.x Linux kernels. > > > > I've had really good luck with the cgroups OOM-killer the last few years > from keeping my nodes getting overwhelmed by runaway jobs. I'd hate to have > to disable it just to clean up these weird issues. > > > > My cgroup.conf file looks like the following: > > > > CgroupAutomount=yes > > ConstrainCores=yes > > ConstrainRAMSpace=yes > ConstrainSwapSpace=yes > > AllowedRamSpace=100 > AllowedSwapSpace=0 > > > > Should I maybe bump AllowedRamSpace? I don't see how this is any different > than just asking the user to re-run the job with a larger memory allocation > request. And that doesn't explain why jobs suddenly need more memory before > getting OOM-killed than they used to. > > > > Thanks, > > > > Sean > > > > > ----------------------------------------------------------------------------------- > > The information in this communication and any attachment is confidential > and intended solely for the attention and use of the named addressee(s). > All information and opinions expressed herein are subject to change without > notice. This communication is not to be construed as an offer to sell or > the solicitation of an offer to buy any security. Any such offer or > solicitation can only be made by means of the delivery of a confidential > private offering memorandum (which should be carefully reviewed for a > complete description of investment strategies and risks). Any reliance one > may place on the accuracy or validity of this information is at their own > risk. Past performance is not necessarily indicative of the future results > of an investment. All figures are estimated and unaudited unless otherwise > noted. If you are not the intended recipient, or a person responsible for > delivering this to the intended recipient, you are not authorized to and > must not disclose, copy, distribute, or retain this message or any part of > it. In this case, please notify the sender immediately at 713-333-5440 >