Hi Roger,
Thanks for the response. I am pretty sure the job is actually getting
killed. I don't see it running in the process table and the local SLURM log
just displays:
[2021-08-10T16:31:36.139] [6628753.batch] error: Detected 1 oom-kill
event(s) in StepId=6628753.batch cgroup. Some of your pro
Do you know if the job is actually being killed? We had an issue on an older
version of slurm whereby we got OOM errors but the tasks actually completed.
The OOM came when the job exited and was a false error.
Also, there are several bug reports open right now about an issue similar to
what
Hi all,
Has anyone else observed jobs getting OOM-killed in 20.11.8 with cgroups
that ran fine in previous versions like 20.10?
I've had a few reports from users after upgrading maybe six weeks ago that
their jobs are getting OOM-killed when they haven't changed anything and
the job ran to comple
You may also want to look at node weights. By setting them at different
levels for each node, you can give a preference to one over the other.
That may be a way to do a "try this node first" method of job placement.
Brian Andrus
On 8/10/2021 9:19 AM, Jack Chen wrote:
Thanks for your reply! It'
Thanks for your reply! It's certain that slurm will not place small jobs on
same node if resources are not available. But I'm using default values in
my issue, job cmd is : srun -n 1 --cpus-per-task=2 --gres=gpu:1 'sleep
12000'.
When I submit another 8 one gpu jobs, they can run both on node A an
You may want to look at your resources. If the memory allocation adds up
such that there isn't enough left for any job to run, it won't matter
that there are still GPUs available.
Similar for any other resource (CPUs, cores, etc)
Brian Andrus
On 8/10/2021 8:07 AM, Jack Chen wrote:
Does anyo
Did Diego's suggestion from [1] not help narrow things down?
[1] https://lists.schedmd.com/pipermail/slurm-users/2021-August/007708.html
From: slurm-users on behalf of Jack
Chen
Date: Tuesday, August 10, 2021 at 10:08 AM
To: Slurm User Community List
Subject: Re: [slurm-users] Compact schedul
Certainly, set:
* *SuspendExcNodes*: List of nodes to never place in power saving
mode. Use Slurm's hostlist expression format. By default, no nodes
are excluded.
Then do 'scontrol reconfigure'
Repeat when you want them to be included
Brian Andrus
On 8/10/2021 5:46 AM, Josef Dvoracek wr
Does anyone have any ideas on this?
On Fri, Aug 6, 2021 at 2:52 PM Jack Chen wrote:
> I'm using slurm15.08.11, when I submit several 1 gpu jobs, slurm doesn't
> allocate nodes using compact strategy. Anyone know how to solve this? Will
> upgrading slurm latest version help ?
>
> For example, the
hi @list,
Sometimes I work/test something on some compute node, having it drained
in slurm.
After SuspendTime passes, such a node is suspended by SuspendProgram,
sometimes exactly in same time when I'm e.g. compiling something on it.
Is there any way how I can temporarily disable powersavin
10 matches
Mail list logo