[slurm-users] Re: Implementing a "soft" wall clock limit

Loris Bennett via slurm-users Wed, 11 Jun 2025 05:51:07 -0700

Hi Davide,

Davide DelVento <davide.quan...@gmail.com> writes:


> Thanks Loris,
>
> Am I correct if reading in between the lines you're saying: rather than going 
> on with my "soft" limit idea, just use the regular hard limits, being 
> generous with
> the default and providing user education instead? In fact that is an 
> alternative approach that I am considering too.

Yes.  We in fact never get anyone complaining about jobs being cancelled
due to having reached their time-limit, even though other resources were
idle.  Having said that, we do occasionally extend the time-limit for
individual jobs, when requested.  We also don't pre-empt any jobs.

Apart from that, I imaging implementing your 'soft' limits robustly
might be quite challenging and/or time-consuming, as I am not aware that
Slurm has anything like that built in.

Cheers,

Loris

> On Wed, Jun 11, 2025 at 6:15 AM Loris Bennett via slurm-users 
> <slurm-users@lists.schedmd.com> wrote:
>
>  Hi Davide,
>
>  Davide DelVento via slurm-users
>  <slurm-users@lists.schedmd.com> writes:
>
>  > In the institution where I work, so far we have managed to live
>  > without mandatory wallclock limits (a policy decided well before I
>  > joined the organization), and that has been possible because the
>  > cluster was not very much utilized.
>  >
>  > Now that is changing, with more jobs being submitted and those being
>  > larger ones. As such I would like to introduce wallclock limits to
>  > allow slurm to be more efficient in scheduling jobs, including with
>  > backfill.
>  >
>  > My concern is that this user base is not used to it and therefore I
>  > want to make it easier for them, and avoid common complaints. I
>  > anticipate one of them would be "my job was cancelled even though
>  > there were enough nodes idle and no other job in line after mine"
>  > (since the cluster utilization is increasing, but not yet always full
>  > like it has been at most other places I know).
>  >
>  > So my question is: is it possible to implement "soft" wallclock limits
>  > in slurm, namely ones which would not be enforced unless necessary to
>  > run more jobs? In other words, is it possible to change the
>  > pre-emptability of a job only after some time has passed? I can think
>  > of some ways to hack this functionality myself with some cron or at
>  > jobs, and that might be easy enough to do, but I am not sure I can
>  > make it robust enough to cover all situations, so I'm looking for
>  > something either slurm-native or (if external solution) field-tested
>  > by someone else already, so that at least the worst kinks have been
>  > already ironed out.
>  >
>  > Thanks in advance for any suggestions you may provide!
>
>  We just have a default wallclock limit of 14 days, but we also have QOS
>  with shorter wallclock limits but with higher priorities, albeit with
>  for fewer jobs and resources:
>
>  $ sqos
>        Name   Priority     MaxWall MaxJobs MaxSubmit            MaxTRESPU
>  ---------- ---------- ----------- ------- --------- --------------------
>      hiprio     100000    03:00:00      50       100   cpu=128,gres/gpu=4
>        prio       1000  3-00:00:00     500      1000   cpu=256,gres/gpu=8
>    standard          0 14-00:00:00    2000     10000  cpu=768,gres/gpu=16
>
>  We also have a page of documentation which explains how users can profit
>  from backfill.  Thus users have a certain incentive to specify a shorter
>  wallclock limit, if they can.
>
>  'sqos' is just an alias for
>
>    sacctmgr show qos 
> format=name,priority,maxwall,maxjobs,maxsubmitjobs,maxtrespu%20
>
>  Cheers,
>
>  Loris
>
>  -- 
>  Dr. Loris Bennett (Herr/Mr)
>  FUB-IT, Freie Universität Berlin
>
>  -- 
>  slurm-users mailing list -- slurm-users@lists.schedmd.com
>  To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>
-- 
Dr. Loris Bennett (Herr/Mr)
FUB-IT, Freie Universität Berlin

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Implementing a "soft" wall clock limit

Reply via email to