[slurm-users] Re: Implementing a "soft" wall clock limit

Loris Bennett via slurm-users Wed, 11 Jun 2025 05:13:06 -0700

Hi Davide,

Davide DelVento via slurm-users
<slurm-users@lists.schedmd.com> writes:


> In the institution where I work, so far we have managed to live
> without mandatory wallclock limits (a policy decided well before I
> joined the organization), and that has been possible because the
> cluster was not very much utilized.
>
> Now that is changing, with more jobs being submitted and those being
> larger ones. As such I would like to introduce wallclock limits to
> allow slurm to be more efficient in scheduling jobs, including with
> backfill.
>
> My concern is that this user base is not used to it and therefore I
> want to make it easier for them, and avoid common complaints. I
> anticipate one of them would be "my job was cancelled even though
> there were enough nodes idle and no other job in line after mine"
> (since the cluster utilization is increasing, but not yet always full
> like it has been at most other places I know).
>
> So my question is: is it possible to implement "soft" wallclock limits
> in slurm, namely ones which would not be enforced unless necessary to
> run more jobs? In other words, is it possible to change the
> pre-emptability of a job only after some time has passed? I can think
> of some ways to hack this functionality myself with some cron or at
> jobs, and that might be easy enough to do, but I am not sure I can
> make it robust enough to cover all situations, so I'm looking for
> something either slurm-native or (if external solution) field-tested
> by someone else already, so that at least the worst kinks have been
> already ironed out.
>
> Thanks in advance for any suggestions you may provide!

We just have a default wallclock limit of 14 days, but we also have QOS
with shorter wallclock limits but with higher priorities, albeit with
for fewer jobs and resources:

$ sqos
      Name   Priority     MaxWall MaxJobs MaxSubmit            MaxTRESPU
---------- ---------- ----------- ------- --------- --------------------
    hiprio     100000    03:00:00      50       100   cpu=128,gres/gpu=4
      prio       1000  3-00:00:00     500      1000   cpu=256,gres/gpu=8
  standard          0 14-00:00:00    2000     10000  cpu=768,gres/gpu=16

We also have a page of documentation which explains how users can profit
from backfill.  Thus users have a certain incentive to specify a shorter
wallclock limit, if they can.

'sqos' is just an alias for

  sacctmgr show qos 
format=name,priority,maxwall,maxjobs,maxsubmitjobs,maxtrespu%20

Cheers,

Loris

-- 
Dr. Loris Bennett (Herr/Mr)
FUB-IT, Freie Universität Berlin


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Implementing a "soft" wall clock limit

Reply via email to