Hi David,

On Tue, 16 Mar 2021 at 06:34, Chin,David <dw...@drexel.edu> wrote:

> * UoM notice: External email. Be cautious of links, attachments, or
> impersonation attempts *
> ------------------------------
> Hi, Sean:
>
> Slurm version 20.02.6 (via Bright Cluster Manager)
>
>   ProctrackType=proctrack/cgroup
>   JobAcctGatherType=jobacct_gather/linux
>   JobAcctGatherParams=UsePss,NoShared
>
>
> I just skimmed https://bugs.schedmd.com/show_bug.cgi?id=5549 because this
> job appeared to have left two slurmstepd zombie processes running at
> 100%CPU each, and changed to:
>
>   ProctrackType=proctrack/cgroup
>   JobAcctGatherType=jobacct_gather/cgroup
>   JobAcctGatherParams=UsePss,NoShared,NoOverMemoryKill
>

You definitely want the NoOverMemoryKill option for JobAcctGatherParams.
This allows cgroups to kill the job, instead of Slurm accounting.


>
>
> Have asked the user to re-run the job, but that has not happened, yet.
>
> cgroup.conf:
>
>   CgroupMountpoint="/sys/fs/cgroup"
>   CgroupAutomount=yes
>   TaskAffinity=yes
>   ConstrainCores=yes
>   ConstrainRAMSpace=yes
>   ConstrainSwapSpace=no
>   ConstrainDevices=yes
>   ConstrainKmemSpace=yes
>   AllowedRamSpace=100.00
>   AllowedSwapSpace=0.00
>   MinKmemSpace=200
>   MaxKmemPercent=100.00
>   MemorySwappiness=100
>   MaxRAMPercent=100.00
>   MaxSwapPercent=100.00
>   MinRAMSpace=200
>

This looks good too. Our site does not restrict kmem space, but at least
now you'll see why cgroups kills the job (on the compute node, cgroup will
show the memory used at the time of the job kill), so you can see if it is
kmem related.

Sean


>
>
> Cheers,
>     Dave
>
> --
> David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
> dw...@drexel.edu                     215.571.4335 (o)
> For URCF support: urcf-supp...@drexel.edu
> https://proteusmaster.urcf.drexel.edu/urcfwiki
> github:prehensilecode
>
>
> ------------------------------
> *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of
> Sean Crosby <scro...@unimelb.edu.au>
> *Sent:* Monday, March 15, 2021 15:22
> *To:* Slurm User Community List <slurm-users@lists.schedmd.com>
> *Subject:* Re: [slurm-users] [EXT] Job ended with OUT_OF_MEMORY even
> though MaxRSS and MaxVMSize are under the ReqMem value
>
>
> External.
> What are your Slurm settings - what's the values of
>
> ProctrackType
> JobAcctGatherType
> JobAcctGatherParams
>
> and what's the contents of cgroup.conf? Also, what version of Slurm are
> you using?
>
> Sean
>
> --
> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
> Research Computing Services | Business Services
> The University of Melbourne, Victoria 3010 Australia
>
>
> Drexel Internal Data
>

Reply via email to