I would have thought partition QoS is the way to do this. We add partition QoS to our partition definitions, and implement quotas on usage as well.
PartitionName=physical Nodes=... Default=YES MaxTime=30-0 DefaultTime=0:10:0 State=DOWN QoS=physical TRESBillingWeights=CPU=1.0,Mem=4.0G We then define the QoS "physical" # sacctmgr show qos physical -p Name|Priority|GraceTime|Preempt|PreemptExemptTime|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GrpSubmit|GrpWall|MaxTRES|MaxTRESPerNode|MaxTRESMins|MaxWall|MaxTRESPU|MaxJobsPU|MaxSubmitPU|MaxTRESPA|MaxJobsPA|MaxSubmitPA|MinTRES| physical|0|00:00:00|||cluster|||1.000000|||||||||||cpu=750,mem=9585888M|||cpu=750,mem=9585888M|||| We implement quotas using MaxTRESPerUser and MaxTRESPerAccount It works really well for us. If you need to override it for a particular group, you can create another QoS, set the OverPartQOS flag, and get the users to specify that QoS. Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services | Business Services The University of Melbourne, Victoria 3010 Australia On Tue, 2 Mar 2021 at 08:24, Stack Korora <stackkor...@disroot.org> wrote: > UoM notice: External email. Be cautious of links, attachments, or > impersonation attempts > > Greetings, > > We have different node classes that we've set up in different > partitions. For example, we have our standard compute nodes in compute; > our GPU's in a gpu partition; and jobs that need to run for months go > into a long partition with a different set of machines. > > For each partition, we have QOS to prevent any single user from > dominating the resources (set at a max of 60% of resources; not my call > - it's politics - I'm not going down that rabbit hole...). > > Thus, I've got something like this in my slurm.conf (abbreviating to > save space; sorry if I trim too much). > > PartitionName=compute [snip] AllowQOS=compute Default=YES > PartitionName=gpu [snip] AllowQOS=gpu Default=NO > PartitionName=long [snip] AllowQOS=long Default=NO > > Then I have my QOS configured. And in my `sacctmgr dump cluster | grep > DefaultQOS` I have "DefaultQOS=compute". > > All of that works exactly as expected. > > This makes it easy/nice for my users to just do something like: > $ sbatch -n1 -N1 -p compute script.sh > > They don't have to specify the QOS for compute and they like this. > > However, for the other partitions they have to do something like this: > $ sbatch -n1 -N1 -p long --qos=long script.sh > > The users don't like this. (though with scripts, I don't see the big > deal in just adding a new line...but you know... users...) > > The request from the users is to make a default QOS for each partition > thus not needing to specify the QOS for the other partitions. > > Because the default is set in the cluster configuration, I'm not sure > how to do this. And I'm not seeing anything in the documentation for a > scenario like this. > > Question A: > Anyone know how I can set a default QOS per partition? > > Question B: > Chesterton's fence and all... Is there a better way to accomplish what > we are attempting to do? I don't want a single QOS to limit across all > partitions. I need a per partition limit that restricts users to 60% of > resources in that partition. > > Thank you! > ~Stack~ > >