Hi Davide, Davide DelVento via slurm-users <slurm-users@lists.schedmd.com> writes:
> In the institution where I work, so far we have managed to live > without mandatory wallclock limits (a policy decided well before I > joined the organization), and that has been possible because the > cluster was not very much utilized. > > Now that is changing, with more jobs being submitted and those being > larger ones. As such I would like to introduce wallclock limits to > allow slurm to be more efficient in scheduling jobs, including with > backfill. > > My concern is that this user base is not used to it and therefore I > want to make it easier for them, and avoid common complaints. I > anticipate one of them would be "my job was cancelled even though > there were enough nodes idle and no other job in line after mine" > (since the cluster utilization is increasing, but not yet always full > like it has been at most other places I know). > > So my question is: is it possible to implement "soft" wallclock limits > in slurm, namely ones which would not be enforced unless necessary to > run more jobs? In other words, is it possible to change the > pre-emptability of a job only after some time has passed? I can think > of some ways to hack this functionality myself with some cron or at > jobs, and that might be easy enough to do, but I am not sure I can > make it robust enough to cover all situations, so I'm looking for > something either slurm-native or (if external solution) field-tested > by someone else already, so that at least the worst kinks have been > already ironed out. > > Thanks in advance for any suggestions you may provide! We just have a default wallclock limit of 14 days, but we also have QOS with shorter wallclock limits but with higher priorities, albeit with for fewer jobs and resources: $ sqos Name Priority MaxWall MaxJobs MaxSubmit MaxTRESPU ---------- ---------- ----------- ------- --------- -------------------- hiprio 100000 03:00:00 50 100 cpu=128,gres/gpu=4 prio 1000 3-00:00:00 500 1000 cpu=256,gres/gpu=8 standard 0 14-00:00:00 2000 10000 cpu=768,gres/gpu=16 We also have a page of documentation which explains how users can profit from backfill. Thus users have a certain incentive to specify a shorter wallclock limit, if they can. 'sqos' is just an alias for sacctmgr show qos format=name,priority,maxwall,maxjobs,maxsubmitjobs,maxtrespu%20 Cheers, Loris -- Dr. Loris Bennett (Herr/Mr) FUB-IT, Freie Universität Berlin -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com