Hi Davide, Davide DelVento <davide.quan...@gmail.com> writes:
> Thanks Loris, > > Am I correct if reading in between the lines you're saying: rather than going > on with my "soft" limit idea, just use the regular hard limits, being > generous with > the default and providing user education instead? In fact that is an > alternative approach that I am considering too. Yes. We in fact never get anyone complaining about jobs being cancelled due to having reached their time-limit, even though other resources were idle. Having said that, we do occasionally extend the time-limit for individual jobs, when requested. We also don't pre-empt any jobs. Apart from that, I imaging implementing your 'soft' limits robustly might be quite challenging and/or time-consuming, as I am not aware that Slurm has anything like that built in. Cheers, Loris > On Wed, Jun 11, 2025 at 6:15 AM Loris Bennett via slurm-users > <slurm-users@lists.schedmd.com> wrote: > > Hi Davide, > > Davide DelVento via slurm-users > <slurm-users@lists.schedmd.com> writes: > > > In the institution where I work, so far we have managed to live > > without mandatory wallclock limits (a policy decided well before I > > joined the organization), and that has been possible because the > > cluster was not very much utilized. > > > > Now that is changing, with more jobs being submitted and those being > > larger ones. As such I would like to introduce wallclock limits to > > allow slurm to be more efficient in scheduling jobs, including with > > backfill. > > > > My concern is that this user base is not used to it and therefore I > > want to make it easier for them, and avoid common complaints. I > > anticipate one of them would be "my job was cancelled even though > > there were enough nodes idle and no other job in line after mine" > > (since the cluster utilization is increasing, but not yet always full > > like it has been at most other places I know). > > > > So my question is: is it possible to implement "soft" wallclock limits > > in slurm, namely ones which would not be enforced unless necessary to > > run more jobs? In other words, is it possible to change the > > pre-emptability of a job only after some time has passed? I can think > > of some ways to hack this functionality myself with some cron or at > > jobs, and that might be easy enough to do, but I am not sure I can > > make it robust enough to cover all situations, so I'm looking for > > something either slurm-native or (if external solution) field-tested > > by someone else already, so that at least the worst kinks have been > > already ironed out. > > > > Thanks in advance for any suggestions you may provide! > > We just have a default wallclock limit of 14 days, but we also have QOS > with shorter wallclock limits but with higher priorities, albeit with > for fewer jobs and resources: > > $ sqos > Name Priority MaxWall MaxJobs MaxSubmit MaxTRESPU > ---------- ---------- ----------- ------- --------- -------------------- > hiprio 100000 03:00:00 50 100 cpu=128,gres/gpu=4 > prio 1000 3-00:00:00 500 1000 cpu=256,gres/gpu=8 > standard 0 14-00:00:00 2000 10000 cpu=768,gres/gpu=16 > > We also have a page of documentation which explains how users can profit > from backfill. Thus users have a certain incentive to specify a shorter > wallclock limit, if they can. > > 'sqos' is just an alias for > > sacctmgr show qos > format=name,priority,maxwall,maxjobs,maxsubmitjobs,maxtrespu%20 > > Cheers, > > Loris > > -- > Dr. Loris Bennett (Herr/Mr) > FUB-IT, Freie Universität Berlin > > -- > slurm-users mailing list -- slurm-users@lists.schedmd.com > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com > -- Dr. Loris Bennett (Herr/Mr) FUB-IT, Freie Universität Berlin -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com