Re: [slurm-users] Jobs killed by OOM-killer only on certain nodes.

2020-07-02 Thread Chris Samuel
On Thursday, 2 July 2020 6:52:15 AM PDT Prentice Bisbal wrote: > [2020-07-01T16:19:19.463] [801777.extern] _oom_event_monitor: oom-kill > event count: 1 We get that line for pretty much every job, I don't think it reflects the OOM killer being invoked on something in the extern step. OOM killer

[slurm-users] Jobs killed by OOM-killer only on certain nodes.

2020-07-02 Thread Prentice Bisbal
I maintain a very heterogeneous cluster (different processors, different amounts of RAM, etc.) I have a user reporting the following problem. He's running the same job multiple times with different input parameters. The jobs run fine unless they land on specific nodes. He's specifying --mem=2G

Re: [slurm-users] Jobs killed by OOM-killer only on certain nodes.

2020-07-02 Thread Ryan Novosielski
Are you sure that the OOM killer is involved? I can get you specifics later, but if it’s that one line about OOM events, you may see it after successful jobs too. I just had a SLURM bug where this came up. -- || \\UTGERS, |---*O*--- ||_/