Re: [slurm-users] Evenly use all nodes

2020-07-02 Thread Angelos Ching
Hi Timo, We have faced similar problem and our solution was to run an hourly cron job to set a random node weight for each node. It works pretty well for us. Best regards, Angelos (Sent from mobile, please pardon me for typos and cursoriness.) > 2020/07/03 2:24、Timo Rothenpieler のメール: > > Hel

Re: [slurm-users] Evenly use all nodes

2020-07-02 Thread Timo Rothenpieler
On 02.07.2020 20:28, Luis Huang wrote: You can look into the CR_LLN feature. It works fairly well in our environment and jobs are distributed evenly. SelectTypeParameters=CR_Core_Memory,CR_LLN From how I understand it, CR_LLN will schedule jobs to the least used node. But if there's nearly n

Re: [slurm-users] Jobs killed by OOM-killer only on certain nodes.

2020-07-02 Thread Chris Samuel
On Thursday, 2 July 2020 6:52:15 AM PDT Prentice Bisbal wrote: > [2020-07-01T16:19:19.463] [801777.extern] _oom_event_monitor: oom-kill > event count: 1 We get that line for pretty much every job, I don't think it reflects the OOM killer being invoked on something in the extern step. OOM killer

Re: [slurm-users] [External] Re: Jobs killed by OOM-killer only on certain nodes.

2020-07-02 Thread Prentice Bisbal
Not 100%, which is why I'm asking here.I searched the log files and that line was only present after a handful of jobs, including the ones I'm investigating, so it's not something happening after/to every job. However, this is happening on nodes with plenty of RAM, so if the OOM Killer is being

Re: [slurm-users] Evenly use all nodes

2020-07-02 Thread Luis Huang
You can look into the CR_LLN feature. It works fairly well in our environment and jobs are distributed evenly. SelectTypeParameters=CR_Core_Memory,CR_LLN https://slurm.schedmd.com/slurm.conf.html From: slurm-users on behalf of Timo Rothenpieler Sent: Thursday,

[slurm-users] Evenly use all nodes

2020-07-02 Thread Timo Rothenpieler
Hello, Our cluster is very rarely fully utilized, often only a handful of jobs are running. This has the effect that the first couple nodes get used a whole lot more frequently than the ones further near the end of the list. This is primarily a problem because of the SSDs in the nodes. They

[slurm-users] Jobs killed by OOM-killer only on certain nodes.

2020-07-02 Thread Prentice Bisbal
I maintain a very heterogeneous cluster (different processors, different amounts of RAM, etc.) I have a user reporting the following problem. He's running the same job multiple times with different input parameters. The jobs run fine unless they land on specific nodes. He's specifying --mem=2G

Re: [slurm-users] Jobs killed by OOM-killer only on certain nodes.

2020-07-02 Thread Ryan Novosielski
Are you sure that the OOM killer is involved? I can get you specifics later, but if it’s that one line about OOM events, you may see it after successful jobs too. I just had a SLURM bug where this came up. -- || \\UTGERS, |---*O*--- ||_/

[slurm-users] Automatically cancel jobs not utilizing their GPUs

2020-07-02 Thread Stephan Roth
Hi all, Does anyone have ideas or suggestions on how to automatically cancel jobs which don't utilize the GPUs allocated to them? The Slurm version in use is 19.05. I'm thinking about collecting GPU utilization per process on all nodes with NVML/nvidia-smi, update a mean value of the collect