Thank you Carsten. I'll take a closer look at the QOS limit approach.
If I'm understanding the documentation correctly, partition limits (non
QOS) are set via the slurm.conf file, and although there are options for
limiting the max number of nodes for a person, and the max cpus per
node, there isn't an option within slurm.conf to limit the max total
number of cpus that someone can use, so my original approach will not work.
The QOS option you mention seems to be the way to do it in order to set
a default limit for everyone on the partition.
The only other approach I can see would be to set an association limit
for every account individually.
Thank you,
-Dj
On 9/23/21 07:18, Carsten Beyer wrote:
Hi Dj,
the solution could be in two QOS. We use something similar to restrict
usage of GPU nodes (MaxTresPU=node=2). Examples below are from our
Testcluster.
1) create a QOS with e.g. MaxTresPU=cpu=200 and assign it to your
partition, e.g.
[root@bta0 ~]# sacctmgr -s show qos maxcpu format=Name,MaxTRESPU
Name MaxTRESPU
---------- -------------
maxcpu cpu=10
[root@bta0 ~]#
[root@bta0 ~]# scontrol show part maxtresputest
PartitionName=maxtresputest
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=maxcpu
If a user submits jobs requesting more cpus his (new) jobs get
'QOSMaxCpuPerUserLimit' in squeue.
kxxxxxx@btlogin1% squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
125316 maxtrespu maxsubmi kxxxxxx PD 0:00 1
(QOSMaxCpuPerUserLimit)
125317 maxtrespu maxsubmi kxxxxxx PD 0:00 1
(QOSMaxCpuPerUserLimit)
125305 maxtrespu maxsubmi kxxxxxx R 0:45 1 btc30
125306 maxtrespu maxsubmi kxxxxxx R 0:45 1 btc30
2) create a second QOS with Flags=DenyOnLimit,OverPartQoS and
MaxTresPU=400. Assign it to a user that should overcome the limit of
200 cpus, but he will be limited then to 400. That user has to use
this QOS, when submiting new jobs, e.g.
[root@bta0 ~]# sacctmgr -s show qos overpart
format=Name,Flags%30,MaxTRESPU
Name Flags MaxTRESPU
---------- ------------------------------ -------------
overpart DenyOnLimit,OverPartQOS cpu=40
Cheers,
Carsten
null