[slurm-users] Re: Setting QoS with slurm 24.05.7

Michael Gutteridge via slurm-users Tue, 22 Apr 2025 07:31:39 -0700

WHoops, my mistake, sorry.  Is this closer to what you want:

MaxTRESRunMinsPU
MaxTRESRunMinsPerUser
Maximum number of TRES minutes each user is able to use. This takes into
consideration the time limit of running jobs. If the limit is reached, no
new jobs are started until other jobs finish to allow time to free up.


https://slurm.schedmd.com/sacctmgr.html#OPT_MaxTRESRunMinsPU

 - Michael

On Tue, Apr 22, 2025 at 1:35 AM Patrick Begou <
patrick.be...@univ-grenoble-alpes.fr> wrote:

> Hi Michael,
>
> thanks for your explanation. I understand that setting
> "MaxTRESMinsPerJob=cpu=172800"  will allow (in my case)
>
> -  a job on the full cluster for 6h
> -  a job on half of the cluster for 12 hours
>
> But if I do not wont the same user to run at the same time 2 jobs on half
> of the cluster for 12 hours (and fill in the cluster for long time) how can
> I limit his running jobs at 172800 minutes*cpu ?
> I was looking for something like "MaxTRESMinsPerUser" but do not find such
> a limitation resource.
>
> Patrick
>
>
>
> Le 18/04/2025 à 17:17, Michael Gutteridge a écrit :
>
> Hi
>
> I think you want one of the "MaxTRESMins*" options:
>
> MaxTRESMins=TRES=<minutes>[,TRES=<minutes>,...]
> MaxTRESMinsPJ=TRES=<minutes>[,TRES=<minutes>,...]
> MaxTRESMinsPerJob=TRES=<minutes>[,TRES=<minutes>,...]
> Maximum number of TRES minutes each job is able to use in this
> association. This is overridden if set directly on a user. Default is the
> cluster's limit. To clear a previously set value use the modify command
> with a new value of -1 for each TRES id.
>
>    - sacctmgr(1)
>
> The "MaxCPUs" is a limit on the number of CPUs the association can use.
>
>  -- Michael
>
>
> On Fri, Apr 18, 2025 at 8:01 AM Patrick Begou via slurm-users <
> slurm-users@lists.schedmd.com> wrote:
>
>> Hi all,
>>
>> I'm trying to setup a QoS on a small 5 nodes cluster running slurm
>> 24.05.7. My goal is to limit the resources on a (time x number of cores)
>> strategy to avoid one large job requesting all the resources for too
>> long time. I've read from https://slurm.schedmd.com/qos.html and some
>> discussion but my setup is still not working.
>>
>> I think I need to set these informations:
>> MaxCPUsPerJob=172800
>> MaxWallDurationPerJob=48:00:00
>> Flags=DenyOnLimit,OverPartQOS
>>
>> for:
>> 12h max for 240 cores => (12*240*60=172800mn)
>> no job can exceed 2 days
>> do not accept jobs out of these limits.
>>
>> What I've done:
>>
>> 1) create the QoS:
>> sudo sacctmgr add qos workflowlimit \
>>       MaxWallDurationPerJob=48:00:00 \
>>       MaxCPUsPerJob=172800 \
>>       Flags=DenyOnLimit,OverPartQOS
>>
>>
>> 2) Check
>> sacctmgr show qos Name=workflowlimit format=Name%16,MaxTRES,MaxWall
>>                 Name       MaxTRES     MaxWall
>>     ---------------- ------------- -----------
>>        workflowlimit    cpu=172800  2-00:00:00
>>
>> 3) Set the QoS for the account "most" which is the default account for
>> the users:
>> sudo sacctmgr modify account name=most set qos=workflowlimit
>>
>> 4) Check
>> $ sacctmgr show assoc format=account,cluster,user,qos
>>     Account    Cluster       User                  QOS
>> ---------- ---------- ---------- --------------------
>>        root     osorno                          normal
>>        root     osorno       root               normal
>>        legi     osorno                          normal
>>        most     osorno                   workflowlimit
>>        most     osorno      begou        workflowlimit
>>
>> 5) Modifiy slurm.conf with:
>>      AccountingStorageEnforce=limits,qos
>> and propagate on the 5 nodes and the front end (done via Ansible)
>>
>> 6) Check
>> clush -b -w osorno-fe,osorno,osorno-0-[0-4] 'grep
>> AccountingStorageEnforce /etc/slurm/slurm.conf'
>> ---------------
>> osorno,osorno-0-[0-4],osorno-fe (7)
>> ---------------
>> AccountingStorageEnforce=limits,qos
>>
>> 7) restart slurmd on all the compute nodes and slurmctld + slurmdbd on
>> the management node.
>>
>> But I can still request 400 cores for 24 hours:
>> [begou@osorno ~]$ srun -n 400 -t 24:0:0 --pty bash
>> bash-5.1$ squeue
>>    JOBID        PARTITION               NAME       USER ST TIME
>> START_TIME TIME_LIMIT CPUS NODELIST(REASON)
>>      147            genoa               bash      begou  R 0:03
>> 2025-04-18T16:52:11 1-00:00:00  400 osorno-0-[0-4]
>>
>> So I must have missed something ?
>>
>> My partition (I've only one) in slurm.conf is:
>> PartitionName=genoa  State=UP Default=YES MaxTime=48:00:00
>> DefaultTime=24:00:00 Shared=YES OverSubscribe=NO Nodes=osorno-0-[0-4]
>>
>> Thanks
>>
>> Patrick
>>
>>
>> --
>> slurm-users mailing list -- slurm-users@lists.schedmd.com
>> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>>
>
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Setting QoS with slurm 24.05.7

Reply via email to