Re: [slurm-users] Spurious OOM-kills with cgroups on 20.11.8?

2021-08-10 Thread Sean Caron
Hi Roger, Thanks for the response. I am pretty sure the job is actually getting killed. I don't see it running in the process table and the local SLURM log just displays: [2021-08-10T16:31:36.139] [6628753.batch] error: Detected 1 oom-kill event(s) in StepId=6628753.batch cgroup. Some of your pro

Re: [slurm-users] Spurious OOM-kills with cgroups on 20.11.8?

2021-08-10 Thread Roger Moye
Do you know if the job is actually being killed? We had an issue on an older version of slurm whereby we got OOM errors but the tasks actually completed. The OOM came when the job exited and was a false error. Also, there are several bug reports open right now about an issue similar to what

[slurm-users] Spurious OOM-kills with cgroups on 20.11.8?

2021-08-10 Thread Sean Caron
Hi all, Has anyone else observed jobs getting OOM-killed in 20.11.8 with cgroups that ran fine in previous versions like 20.10? I've had a few reports from users after upgrading maybe six weeks ago that their jobs are getting OOM-killed when they haven't changed anything and the job ran to comple

Re: [slurm-users] Compact scheduling strategy for small GPU jobs

2021-08-10 Thread Brian Andrus
You may also want to look at node weights. By setting them at different levels for each node, you can give a preference to one over the other. That may be a way to do a "try this node first" method of job placement. Brian Andrus On 8/10/2021 9:19 AM, Jack Chen wrote: Thanks for your reply! It'

Re: [slurm-users] Compact scheduling strategy for small GPU jobs

2021-08-10 Thread Jack Chen
Thanks for your reply! It's certain that slurm will not place small jobs on same node if resources are not available. But I'm using default values in my issue, job cmd is : srun -n 1 --cpus-per-task=2 --gres=gpu:1 'sleep 12000'. When I submit another 8 one gpu jobs, they can run both on node A an

Re: [slurm-users] Compact scheduling strategy for small GPU jobs

2021-08-10 Thread Brian Andrus
You may want to look at your resources. If the memory allocation adds up such that there isn't enough left for any job to run, it won't matter that there are still GPUs available. Similar for any other resource (CPUs, cores, etc) Brian Andrus On 8/10/2021 8:07 AM, Jack Chen wrote: Does anyo

Re: [slurm-users] Compact scheduling strategy for small GPU jobs

2021-08-10 Thread Renfro, Michael
Did Diego's suggestion from [1] not help narrow things down? [1] https://lists.schedmd.com/pipermail/slurm-users/2021-August/007708.html From: slurm-users on behalf of Jack Chen Date: Tuesday, August 10, 2021 at 10:08 AM To: Slurm User Community List Subject: Re: [slurm-users] Compact schedul

Re: [slurm-users] how to temporarily avoid node being suspended by SuspendProgram

2021-08-10 Thread Brian Andrus
Certainly, set: * *SuspendExcNodes*: List of nodes to never place in power saving mode. Use Slurm's hostlist expression format. By default, no nodes are excluded. Then do 'scontrol reconfigure' Repeat when you want them to be included Brian Andrus On 8/10/2021 5:46 AM, Josef Dvoracek wr

Re: [slurm-users] Compact scheduling strategy for small GPU jobs

2021-08-10 Thread Jack Chen
Does anyone have any ideas on this? On Fri, Aug 6, 2021 at 2:52 PM Jack Chen wrote: > I'm using slurm15.08.11, when I submit several 1 gpu jobs, slurm doesn't > allocate nodes using compact strategy. Anyone know how to solve this? Will > upgrading slurm latest version help ? > > For example, the

[slurm-users] how to temporarily avoid node being suspended by SuspendProgram

2021-08-10 Thread Josef Dvoracek
hi @list, Sometimes I work/test something on some compute node, having it drained in slurm. After SuspendTime passes, such a node is suspended by SuspendProgram, sometimes exactly in same time when I'm e.g. compiling something on it. Is there any way how I can temporarily disable powersavin