Re: [slurm-users] MPI Jobs OOM-killed which weren't pre-21.08.5

2022-02-14 Thread Paul Brunk
Hi: Thanks for your feedback guys :). We continue to find srun behaving properly re: core placement. BTW, we've further established that only MVAPICH (and therefore also Intel MPI) jobs are encountering the OOM issue. == Paul Brunk, system administrator Georgia Advanced Resource Computing Cent

Re: [slurm-users] MPI Jobs OOM-killed which weren't pre-21.08.5

2022-02-10 Thread Paul Edmon
We also noticed the same thing with 21.08.5.  In the 21.08 series SchedMD changed the way they handle cgroups to set the stage for cgroups v2 (see: https://slurm.schedmd.com/SLUG21/Roadmap.pdf). The 21.08.5 introduced a bug fix which then caused mpirun to not pin properly (particularly for olde

Re: [slurm-users] MPI Jobs OOM-killed which weren't pre-21.08.5

2022-02-10 Thread Ward Poelmans
Hi Paul, On 10/02/2022 14:33, Paul Brunk wrote: Now we see a problem in which the OOM killer is in some cases predictably killing job steps who don't seem to deserve it.  In some cases these are job scripts and input files which ran fine before our Slurm upgrade.  More details follow, but th

[slurm-users] MPI Jobs OOM-killed which weren't pre-21.08.5

2022-02-10 Thread Paul Brunk
Hello all: We upgraded from 20.11.8 to 21.08.5 (CentOS 7.9, Slurm built without pmix support) recently. After that, we found that in many cases, 'mpirun' was causing multi-node MPI jobs to have all MPI ranks within a node run on the same core. We've moved on to 'srun'. Now we see a problem in w