[slurm-users] Re: Setting QoS with slurm 24.05.7

Patrick Begou via slurm-users Tue, 22 Apr 2025 01:36:32 -0700

Hi Michael,

thanks for your explanation. I understand that setting"MaxTRESMinsPerJob=cpu=172800" will allow (in my case)


-  a job on the full cluster for 6h
-  a job on half of the cluster for 12 hours

But if I do not wont the same user to run at the same time 2 jobs onhalf of the cluster for 12 hours (and fill in the cluster for long time)how can I limit his running jobs at 172800 minutes*cpu ?I was looking for something like "MaxTRESMinsPerUser" but do not findsuch a limitation resource.


Patrick



Le 18/04/2025 à 17:17, Michael Gutteridge a écrit :

Hi

I think you want one of the "MaxTRESMins*" options:

MaxTRESMins=TRES=<minutes>[,TRES=<minutes>,...]
MaxTRESMinsPJ=TRES=<minutes>[,TRES=<minutes>,...]
MaxTRESMinsPerJob=TRES=<minutes>[,TRES=<minutes>,...]

Maximum number of TRES minutes each job is able to use in thisassociation. This is overridden if set directly on a user. Default isthe cluster's limit. To clear a previously set value use the modifycommand with a new value of -1 for each TRES id.


   - sacctmgr(1)

The "MaxCPUs" is a limit on the number of CPUs the association can use.

 -- Michael

On Fri, Apr 18, 2025 at 8:01 AM Patrick Begou via slurm-users<slurm-users@lists.schedmd.com> wrote:


    Hi all,

    I'm trying to setup a QoS on a small 5 nodes cluster running slurm
    24.05.7. My goal is to limit the resources on a (time x number of
    cores)
    strategy to avoid one large job requesting all the resources for too
    long time. I've read from https://slurm.schedmd.com/qos.html and some
    discussion but my setup is still not working.

    I think I need to set these informations:
    MaxCPUsPerJob=172800
    MaxWallDurationPerJob=48:00:00
    Flags=DenyOnLimit,OverPartQOS

    for:
    12h max for 240 cores => (12*240*60=172800mn)
    no job can exceed 2 days
    do not accept jobs out of these limits.

    What I've done:

    1) create the QoS:
    sudo sacctmgr add qos workflowlimit \
          MaxWallDurationPerJob=48:00:00 \
          MaxCPUsPerJob=172800 \
          Flags=DenyOnLimit,OverPartQOS


    2) Check
    sacctmgr show qos Name=workflowlimit format=Name%16,MaxTRES,MaxWall
                    Name       MaxTRES     MaxWall
        ---------------- ------------- -----------
           workflowlimit    cpu=172800  2-00:00:00

    3) Set the QoS for the account "most" which is the default account
    for
    the users:
    sudo sacctmgr modify account name=most set qos=workflowlimit

    4) Check
    $ sacctmgr show assoc format=account,cluster,user,qos
        Account    Cluster       User                  QOS
    ---------- ---------- ---------- --------------------
           root     osorno                          normal
           root     osorno       root               normal
           legi     osorno                          normal
           most     osorno                   workflowlimit
           most     osorno      begou        workflowlimit

    5) Modifiy slurm.conf with:
         AccountingStorageEnforce=limits,qos
    and propagate on the 5 nodes and the front end (done via Ansible)

    6) Check
    clush -b -w osorno-fe,osorno,osorno-0-[0-4] 'grep
    AccountingStorageEnforce /etc/slurm/slurm.conf'
    ---------------
    osorno,osorno-0-[0-4],osorno-fe (7)
    ---------------
    AccountingStorageEnforce=limits,qos

    7) restart slurmd on all the compute nodes and slurmctld +
    slurmdbd on
    the management node.

    But I can still request 400 cores for 24 hours:
    [begou@osorno ~]$ srun -n 400 -t 24:0:0 --pty bash
    bash-5.1$ squeue
       JOBID        PARTITION               NAME       USER ST TIME
    START_TIME TIME_LIMIT CPUS NODELIST(REASON)
         147            genoa               bash      begou  R 0:03
    2025-04-18T16:52:11 1-00:00:00  400 osorno-0-[0-4]

    So I must have missed something ?

    My partition (I've only one) in slurm.conf is:
    PartitionName=genoa  State=UP Default=YES MaxTime=48:00:00
    DefaultTime=24:00:00 Shared=YES OverSubscribe=NO Nodes=osorno-0-[0-4]

    Thanks

    Patrick

--slurm-users mailing list -- slurm-users@lists.schedmd.com

    To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Setting QoS with slurm 24.05.7

Reply via email to