[slurm-users] Re: Setting QoS with slurm 24.05.7

Patrick Begou via slurm-users Fri, 25 Apr 2025 05:35:50 -0700

Yes Michael!  With this setup it does the job.
There are so many tuning possibilities in Slurm I had missed this one.


Thank you very much.

Patrick

Le 22/04/2025 à 16:30, Michael Gutteridge a écrit :

WHoops, my mistake, sorry.  Is this closer to what you want:

MaxTRESRunMinsPU
MaxTRESRunMinsPerUser

Maximum number of TRES minutes each user is able to use. This takesinto consideration the time limit of running jobs. If the limit isreached, no new jobs are started until other jobs finish to allow timeto free up.


https://slurm.schedmd.com/sacctmgr.html#OPT_MaxTRESRunMinsPU

 - Michael

On Tue, Apr 22, 2025 at 1:35 AM Patrick Begou<patrick.be...@univ-grenoble-alpes.fr> wrote:


    Hi Michael,

    thanks for your explanation. I understand that setting
    "MaxTRESMinsPerJob=cpu=172800"  will allow (in my case)

    -  a job on the full cluster for 6h
    -  a job on half of the cluster for 12 hours

    But if I do not wont the same user to run at the same time 2 jobs
    on half of the cluster for 12 hours (and fill in the cluster for
    long time) how can I limit his running jobs at 172800 minutes*cpu ?
    I was looking for something like "MaxTRESMinsPerUser" but do not
    find such a limitation resource.

    Patrick



    Le 18/04/2025 à 17:17, Michael Gutteridge a écrit :

    Hi

    I think you want one of the "MaxTRESMins*" options:

    MaxTRESMins=TRES=<minutes>[,TRES=<minutes>,...]
    MaxTRESMinsPJ=TRES=<minutes>[,TRES=<minutes>,...]
    MaxTRESMinsPerJob=TRES=<minutes>[,TRES=<minutes>,...]
    Maximum number of TRES minutes each job is able to use in this
    association. This is overridden if set directly on a user.
    Default is the cluster's limit. To clear a previously set value
    use the modify command with a new value of -1 for each TRES id.

     - sacctmgr(1)

    The "MaxCPUs" is a limit on the number of CPUs the association
    can use.

     -- Michael


    On Fri, Apr 18, 2025 at 8:01 AM Patrick Begou via slurm-users
    <slurm-users@lists.schedmd.com> wrote:

        Hi all,

        I'm trying to setup a QoS on a small 5 nodes cluster running
        slurm
        24.05.7. My goal is to limit the resources on a (time x
        number of cores)
        strategy to avoid one large job requesting all the resources
        for too
        long time. I've read from https://slurm.schedmd.com/qos.html
        and some
        discussion but my setup is still not working.

        I think I need to set these informations:
        MaxCPUsPerJob=172800
        MaxWallDurationPerJob=48:00:00
        Flags=DenyOnLimit,OverPartQOS

        for:
        12h max for 240 cores => (12*240*60=172800mn)
        no job can exceed 2 days
        do not accept jobs out of these limits.

        What I've done:

        1) create the QoS:
        sudo sacctmgr add qos workflowlimit \
              MaxWallDurationPerJob=48:00:00 \
              MaxCPUsPerJob=172800 \
              Flags=DenyOnLimit,OverPartQOS


        2) Check
        sacctmgr show qos Name=workflowlimit
        format=Name%16,MaxTRES,MaxWall
                        Name       MaxTRES     MaxWall
            ---------------- ------------- -----------
               workflowlimit    cpu=172800  2-00:00:00

        3) Set the QoS for the account "most" which is the default
        account for
        the users:
        sudo sacctmgr modify account name=most set qos=workflowlimit

        4) Check
        $ sacctmgr show assoc format=account,cluster,user,qos
            Account    Cluster       User                  QOS
        ---------- ---------- ---------- --------------------
               root     osorno                          normal
               root     osorno       root               normal
               legi     osorno                          normal
               most     osorno                   workflowlimit
               most     osorno      begou        workflowlimit

        5) Modifiy slurm.conf with:
             AccountingStorageEnforce=limits,qos
        and propagate on the 5 nodes and the front end (done via Ansible)

        6) Check
        clush -b -w osorno-fe,osorno,osorno-0-[0-4] 'grep
        AccountingStorageEnforce /etc/slurm/slurm.conf'
        ---------------
        osorno,osorno-0-[0-4],osorno-fe (7)
        ---------------
        AccountingStorageEnforce=limits,qos

        7) restart slurmd on all the compute nodes and slurmctld +
        slurmdbd on
        the management node.

        But I can still request 400 cores for 24 hours:
        [begou@osorno ~]$ srun -n 400 -t 24:0:0 --pty bash
        bash-5.1$ squeue
           JOBID        PARTITION               NAME USER ST TIME
        START_TIME TIME_LIMIT CPUS NODELIST(REASON)
             147            genoa               bash begou  R 0:03
        2025-04-18T16:52:11 1-00:00:00  400 osorno-0-[0-4]

        So I must have missed something ?

        My partition (I've only one) in slurm.conf is:
        PartitionName=genoa  State=UP Default=YES MaxTime=48:00:00
        DefaultTime=24:00:00 Shared=YES OverSubscribe=NO
        Nodes=osorno-0-[0-4]

        Thanks

        Patrick

--slurm-users mailing list -- slurm-users@lists.schedmd.com

        To unsubscribe send an email to
        slurm-users-le...@lists.schedmd.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Setting QoS with slurm 24.05.7

Reply via email to