Hi Thomas. With that partition configuration, I suspect jobs are going through the partition without the QoS 'normal' which restricts the number of GPUs per user.
You may find that reconfiguring the partition to have a QoS of 'normal' will result in the GPU limit being applied, as intended. This is set in the partition configuration in slurm.conf. Killian On Thu, 7 May 2020 at 18:25, Theis, Thomas <thomas.th...@teledyne.com> wrote: > Here is the outputs > > sacctmgr show qos –p > > > > > Name|Priority|GraceTime|Preempt|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GtPA|MinTRES| > > > normal|10000|00:00:00||cluster|||1.000000|gres/gpu=2||||||||||gres/gpu=2||||||| > > now|1000000|00:00:00||cluster|||1.000000|||||||||||||||||| > > high|100000|00:00:00||cluster|||1.000000|||||||||||||||||| > > > > scontrol show part > > > > PartitionName=PART1 > > AllowGroups=trace_unix_group AllowAccounts=ALL AllowQos=ALL > > AllocNodes=ALL Default=NO QoS=N/A > > DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 > Hidden=NO > > MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO > MaxCPUsPerNode=UNLIMITED > > Nodes=node1,node2,node3,node4,…. PriorityJobFactor=1 PriorityTier=1 > RootOnly=NO ReqResv=NO OverSubscribe=NO > > OverTimeLimit=NONE PreemptMode=OFF > > State=UP TotalCPUs=236 TotalNodes=11 SelectTypeParameters=NONE > > JobDefaults=(null) > > DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED > > > > *Thomas Theis* > > > > *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> *On Behalf Of > *Sean Crosby > *Sent:* Wednesday, May 6, 2020 6:22 PM > *To:* Slurm User Community List <slurm-users@lists.schedmd.com> > *Subject:* Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user > per partition > > > > External Email > > Do you have other limits set? The QoS is hierarchical, and especially > partition QoS can override other QoS. > > > > What's the output of > > > > sacctmgr show qos -p > > > > and > > > > scontrol show part > > > > Sean > > > > -- > Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead > Research Computing Services | Business Services > The University of Melbourne, Victoria 3010 Australia > > > > > > On Wed, 6 May 2020 at 23:44, Theis, Thomas <thomas.th...@teledyne.com> > wrote: > > *UoM notice: External email. Be cautious of links, attachments, or > impersonation attempts.* > ------------------------------ > > Still have the same issue when I updated the user and qos.. > > Command I am using. > > ‘sacctmgr modify qos normal set MaxTRESPerUser=gres/gpu=2’ > > I restarted the services. Unfortunately I am still have to saturate the > cluster with jobs. > > > > We have a cluster of 10 nodes each with 4 gpus, for a total of 40 gpus. > Each node is identical in the software, OS, SLURM. etc.. I am trying to > limit each user to only be able to use 2 out of 40 gpus across the entire > cluster or partition. A intended bottle neck so no one can saturate the > cluster.. > > > > I.E. desired outcome would be. Person A submits 100 jobs, 2 would run , > and 98 would be pending, 38 gpus would be idle. Once the 2 running are > finished, 2 more would run and 96 would be pending, still 38 gpus would be > idle.. > > > > > > > > *Thomas Theis* > > > > *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> *On Behalf Of > *Sean Crosby > *Sent:* Tuesday, May 5, 2020 6:48 PM > *To:* Slurm User Community List <slurm-users@lists.schedmd.com> > *Subject:* Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user > per partition > > > > External Email > > Hi Thomas, > > > > That value should be > > > > sacctmgr modify qos gpujobs set MaxTRESPerUser=gres/gpu=4 > > > > Sean > > > > -- > Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead > Research Computing Services | Business Services > The University of Melbourne, Victoria 3010 Australia > > > > > > On Wed, 6 May 2020 at 04:53, Theis, Thomas <thomas.th...@teledyne.com> > wrote: > > *UoM notice: External email. Be cautious of links, attachments, or > impersonation attempts.* > ------------------------------ > > Hey Killian, > > > > I tried to limit the number of gpus a user can run on at a time by adding > MaxTRESPerUser = gres:gpu4 to both the user and the qos.. I restarted slurm > control daemon and unfortunately I am still able to run on all the gpus in > the partition. Any other ideas? > > > > *Thomas Theis* > > > > *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> *On Behalf Of > *Killian Murphy > *Sent:* Thursday, April 23, 2020 1:33 PM > *To:* Slurm User Community List <slurm-users@lists.schedmd.com> > *Subject:* Re: [slurm-users] Limit the number of GPUS per user per > partition > > > > External Email > > Hi Thomas. > > > > We limit the maximum number of GPUs a user can have allocated in a > partition through the MaxTRESPerUser field of a QoS for GPU jobs, which is > set as the partition QoS on our GPU partition. I.E: > > > > We have a QOS `gpujobs` that sets MaxTRESPerUser => gres/gpu:4 to limit > total number of allocated GPUs to 4, and set the GPU partition QoS to the > `gpujobs` QoS. > > > > There is a section in the Slurm documentation on the 'Resource Limits' > page entitled 'QOS specific limits supported ( > https://slurm.schedmd.com/resource_limits.html) that details some care > needed when using this kind of limit setting with typed GRES. Although it > seems like you are trying to do something with generic GRES, it's worth a > read! > > > > Killian > > > > > > > > On Thu, 23 Apr 2020 at 18:19, Theis, Thomas <thomas.th...@teledyne.com> > wrote: > > Hi everyone, > > First message, I am trying find a good way or multiple ways to limit the > usage of jobs per node or use of gpus per node, without blocking a user > from submitting them. > > > > Example. We have 10 nodes each with 4 gpus in a partition. We allow a team > of 6 people to submit jobs to any or all of the nodes. One job per gpu; > thus we can hold a total of 40 jobs concurrently in the partition. > > At the moment: each user usually submit 50- 100 jobs at once. Taking up > all gpus, and all other users have to wait in pending.. > > > > What I am trying to setup is allow all users to submit as many jobs as > they wish but only run on 1 out of the 4 gpus per node, or some number out > of the total 40 gpus across the entire partition. Using slurm 18.08.3.. > > > > This is roughly our slurm scripts. > > > > #SBATCH --job-name=Name # Job name > > #SBATCH --mem=5gb # Job memory request > > #SBATCH --ntasks=1 > > #SBATCH --gres=gpu:1 > > #SBATCH --partition=PART1 > > #SBATCH --time=200:00:00 # Time limit hrs:min:sec > > #SBATCH --output=job _%j.log # Standard output and error log > > #SBATCH --nodes=1 > > #SBATCH --qos=high > > > > srun -n1 --gres=gpu:1 --exclusive --export=ALL bash -c > "NV_GPU=$SLURM_JOB_GPUS nvidia-docker run --rm -e > SLURM_JOB_ID=$SLURM_JOB_ID -e SLURM_OUTPUT=$SLURM_OUTPUT --name > $SLURM_JOB_ID do_job.sh" > > > > *Thomas Theis* > > > > > > > -- > > Killian Murphy > > Research Software Engineer > > > > Wolfson Atmospheric Chemistry Laboratories > University of York > Heslington > York > YO10 5DD > +44 (0)1904 32 4753 > > e-mail disclaimer: http://www.york.ac.uk/docs/disclaimer/email.htm > > -- Killian Murphy Research Software Engineer Wolfson Atmospheric Chemistry Laboratories University of York Heslington York YO10 5DD +44 (0)1904 32 4753 e-mail disclaimer: http://www.york.ac.uk/docs/disclaimer/email.htm