Hi Timo,
We have faced similar problem and our solution was to run an hourly cron job to
set a random node weight for each node. It works pretty well for us.
Best regards,
Angelos
(Sent from mobile, please pardon me for typos and cursoriness.)
> 2020/07/03 2:24、Timo Rothenpieler のメール:
>
> Hel
On 02.07.2020 20:28, Luis Huang wrote:
You can look into the CR_LLN feature. It works fairly well in our
environment and jobs are distributed evenly.
SelectTypeParameters=CR_Core_Memory,CR_LLN
From how I understand it, CR_LLN will schedule jobs to the least used
node. But if there's nearly n
On Thursday, 2 July 2020 6:52:15 AM PDT Prentice Bisbal wrote:
> [2020-07-01T16:19:19.463] [801777.extern] _oom_event_monitor: oom-kill
> event count: 1
We get that line for pretty much every job, I don't think it reflects the OOM
killer being invoked on something in the extern step.
OOM killer
Not 100%, which is why I'm asking here.I searched the log files and that
line was only present after a handful of jobs, including the ones I'm
investigating, so it's not something happening after/to every job.
However, this is happening on nodes with plenty of RAM, so if the OOM
Killer is being
You can look into the CR_LLN feature. It works fairly well in our environment
and jobs are distributed evenly.
SelectTypeParameters=CR_Core_Memory,CR_LLN
https://slurm.schedmd.com/slurm.conf.html
From: slurm-users on behalf of Timo
Rothenpieler
Sent: Thursday,
Hello,
Our cluster is very rarely fully utilized, often only a handful of jobs
are running.
This has the effect that the first couple nodes get used a whole lot
more frequently than the ones further near the end of the list.
This is primarily a problem because of the SSDs in the nodes. They
I maintain a very heterogeneous cluster (different processors, different
amounts of RAM, etc.) I have a user reporting the following problem.
He's running the same job multiple times with different input
parameters. The jobs run fine unless they land on specific nodes. He's
specifying --mem=2G
Are you sure that the OOM killer is involved? I can get you specifics later,
but if it’s that one line about OOM events, you may see it after successful
jobs too. I just had a SLURM bug where this came up.
--
|| \\UTGERS, |---*O*---
||_/
Hi all,
Does anyone have ideas or suggestions on how to automatically cancel
jobs which don't utilize the GPUs allocated to them?
The Slurm version in use is 19.05.
I'm thinking about collecting GPU utilization per process on all nodes
with NVML/nvidia-smi, update a mean value of the collect