Hi Janne, 
On Fri, 2019-01-11 at 10:37 +0200, Janne Blomqvist wrote:
> On 11/01/2019 08.29, Sergey Koposov wrote:
> > What is your memory limit configuration in slurm? Anyway, a few things to 
> > check:
I guess these are the most relevant (uncommented) params I could see in the 
slurm.conf are

SelectTypeParameters=CR_Core_Memory
JobAcctGatherType=jobacct_gather/linux
TaskPlugin=task/affinity
> - Make sure you're not limiting RLIMIT_AS in any way (e.g. run "ulimit -v" in 
> your batch script, ensure it's unlimited. In the slurm config, ensure
> VSizeFactor=0).
No, it is clearly not ulimit's issue 
as I'm using essentially my pbs script that worked fine before, plus I'm seeing 
these kind of errors
slurmstepd: error: Job 134 exceeded memory limit (146371328 > 131072000), being 
killed
slurmstepd: error: *** JOB 134 ON compute-1-26 CANCELLED AT 2019-01-11T03:22:03 
**
The VsizeFactor option is commented out.

> - Are you using task/cgroup for limiting memory? In that case the problem 
> might be that cgroup memory limits work with RSS, and as you're running 
> multiple
> processes the shared mmap'ed file will be counted multiple times. There's no 
> really good way around this, but with, say, something like
> 
> ConstrainRAMSpace=no
> ConstrainSwapSpace=yes
> AllowedRAMSpace=100
> AllowedSwapSpace=1600
> you'll get a setup where the cgroup soft limit will be set to the amount your 
> job allocates, but the hard limit (where the job will be killed) will be set 
> to
> 1600% of that.
> - If you're using cgroups for memory limits, you should also set 
> JobAcctGatherParams=NoOverMemoryKill
> - If you're NOT using cgroups for memory limits, try setting 
> JobAcctGatherParams=UsePSS which should avoiding counting the shared mappings 
> multiple times.
(not sure if cgroup is used currently..) But thanks for the suggestions. We'll 
try those and report back.

Regards, 
         Sergey

Reply via email to