Re: [slurm-users] [External] Re: Jobs killed by OOM-killer only on certain nodes.

Prentice Bisbal Thu, 02 Jul 2020 12:10:00 -0700

Not 100%, which is why I'm asking here.I searched the log files and thatline was only present after a handful of jobs, including the ones I'minvestigating, so it's not something happening after/to every job.However, this is happening on nodes with plenty of RAM, so if the OOMKiller is being invoked, something odd is definitely going on.


On 7/2/20 10:20 AM, Ryan Novosielski wrote:

Are you sure that the OOM killer is involved? I can get you specificslater, but if it’s that one line about OOM events, you may see itafter successful jobs too. I just had a SLURM bug where this came up.
--
____
|| \\UTGERS, |---------------------------*O*---------------------------
||_// the State | Ryan Novosielski - novos...@rutgers.edu<mailto:novos...@rutgers.edu>|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHSCampus|| \\ of NJ | Office of Advanced Research Computing - MSBC630, Newark
    `'
On Jul 2, 2020, at 09:53, Prentice Bisbal <pbis...@pppl.gov> wrote:
I maintain a very heterogeneous cluster (different processors,different amounts of RAM, etc.) I have a user reporting the followingproblem.
He's running the same job multiple times with different inputparameters. The jobs run fine unless they land on specific nodes.He's specifying --mem=2G in his sbatch files. On the nodes where thejobs fail, I see that the OOM killer is invoked, so I asked him tospecify more RAM, so he did. He set --mem=4G, and still the jobs failon these 2 nodes. However, they run just fine on other nodes with--mem=2G.
When I look at the slurm log file on the nodes, I see something likethis for a failing job (in this case, --mem=4G was set)
[2020-07-01T16:19:06.222] _run_prolog: prolog with lock for job801777 ran for 0 seconds[2020-07-01T16:19:06.479] [801777.extern] task/cgroup:/slurm/uid_40324/job_801777: alloc=4096MB mem.limit=4096MBmemsw.limit=unlimited[2020-07-01T16:19:06.483] [801777.extern] task/cgroup:/slurm/uid_40324/job_801777/step_extern: alloc=4096MBmem.limit=4096MB memsw.limit=unlimited
[2020-07-01T16:19:06.506] Launching batch job 801777 for UID 40324
[2020-07-01T16:19:06.621] [801777.batch] task/cgroup:/slurm/uid_40324/job_801777: alloc=4096MB mem.limit=4096MBmemsw.limit=unlimited[2020-07-01T16:19:06.623] [801777.batch] task/cgroup:/slurm/uid_40324/job_801777/step_batch: alloc=4096MB mem.limit=4096MBmemsw.limit=unlimited[2020-07-01T16:19:19.385] [801777.batch] sendingREQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0
[2020-07-01T16:19:19.389] [801777.batch] done with job
[2020-07-01T16:19:19.463] [801777.extern] _oom_event_monitor:oom-kill event count: 1
[2020-07-01T16:19:19.508] [801777.extern] done with job
Any ideas why the jobs are failing on just these two nodes, whilethey run just fine on many other nodes?
For now, the user is excluding these two nodes using the -x option tosbatch, but I'd really like to understand what's going on here.
--

Prentice

--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov

Re: [slurm-users] [External] Re: Jobs killed by OOM-killer only on certain nodes.

Reply via email to