[slurm-users] Setting QoS with slurm 24.05.7

Patrick Begou via slurm-users Fri, 18 Apr 2025 09:17:19 -0700

Hi all,

I'm trying to setup a QoS on a small 5 nodes cluster running slurm24.05.7. My goal is to limit the resources on a (time x number of cores)strategy to avoid one large job requesting all the resources for toolong time. I've read from https://slurm.schedmd.com/qos.html and somediscussion but my setup is still not working.


I think I need to set these informations:
MaxCPUsPerJob=172800
MaxWallDurationPerJob=48:00:00
Flags=DenyOnLimit,OverPartQOS

for:
12h max for 240 cores => (12*240*60=172800mn)
no job can exceed 2 days
do not accept jobs out of these limits.

What I've done:

1) create the QoS:
sudo sacctmgr add qos workflowlimit \
     MaxWallDurationPerJob=48:00:00 \
     MaxCPUsPerJob=172800 \
     Flags=DenyOnLimit,OverPartQOS


2) Check
sacctmgr show qos Name=workflowlimit format=Name%16,MaxTRES,MaxWall
               Name       MaxTRES     MaxWall
   ---------------- ------------- -----------
      workflowlimit    cpu=172800  2-00:00:00

3) Set the QoS for the account "most" which is the default account forthe users:

sudo sacctmgr modify account name=most set qos=workflowlimit

4) Check
$ sacctmgr show assoc format=account,cluster,user,qos
   Account    Cluster       User                  QOS
---------- ---------- ---------- --------------------
      root     osorno                          normal
      root     osorno       root               normal
      legi     osorno                          normal
      most     osorno                   workflowlimit
      most     osorno      begou        workflowlimit

5) Modifiy slurm.conf with:
    AccountingStorageEnforce=limits,qos
and propagate on the 5 nodes and the front end (done via Ansible)

6) Check

clush -b -w osorno-fe,osorno,osorno-0-[0-4] 'grepAccountingStorageEnforce /etc/slurm/slurm.conf'

---------------
osorno,osorno-0-[0-4],osorno-fe (7)
---------------
AccountingStorageEnforce=limits,qos

7) restart slurmd on all the compute nodes and slurmctld + slurmdbd onthe management node.


But I can still request 400 cores for 24 hours:
[begou@osorno ~]$ srun -n 400 -t 24:0:0 --pty bash
bash-5.1$ squeue

JOBID PARTITION NAME USER ST TIME START_TIME TIME_LIMIT CPUS NODELIST(REASON) 147 genoa bash begou R 0:032025-04-18T16:52:11 1-00:00:00 400 osorno-0-[0-4]


So I must have missed something ?

My partition (I've only one) in slurm.conf is:

PartitionName=genoa State=UP Default=YES MaxTime=48:00:00DefaultTime=24:00:00 Shared=YES OverSubscribe=NO Nodes=osorno-0-[0-4]


Thanks

Patrick


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Setting QoS with slurm 24.05.7

Reply via email to