[slurm-users] Preemption Scheduler Tuning Knobs

Reed Dier via slurm-users Sat, 18 Oct 2025 06:00:57 -0700

Hoping someone may be able to help demystify some questions around the 
scheduler and preemption decisions.


I'm trying to help better predict the scheduler behavior as it pertains to 
preemption to create more predictable scheduling.
Version is 24.11.5, OS U22.04

I have a fairly simple lo/hi partition which the same set of nodes are assigned
> PartitionName=partition-lo           Nodes=foo[00..NN]      Default=YES       
>   MaxTime=INFINITE    OverSubscribe=FORCE:1   State=UP
> PartitionName=partition-hi           Nodes=foo[00..NN]      Default=YES       
>   MaxTime=INFINITE    OverSubscribe=NO        State=UP        PreemptMode=OFF

And then I have 2 QOSs per partition
>             Name   Priority  GraceTime                Preempt PreemptMode 
> UsageFactor MaxJobsPU     MaxTRESPA MaxJobsPA
> ---------------- ---------- ---------- ---------------------- ----------- 
> ----------- --------- ------------- ---------
> qos-stateless-lo          1   00:00:00                            requeue    
> 1.000000
>  qos-stateful-lo          1   00:00:00                            suspend    
> 1.000000        NN       cpu=NNN       NNN
>  qos-stateful-hi          5   00:00:00 qos-state[ful,less]-lo     cluster    
> 1.000000                 cpu=NNN
> qos-stateless-hi          5   00:00:00 qos-state[ful,less]-lo     cluster    
> 1.000000


And the general way that it works out is that stateful jobs will typically 
spawn stateless jobs.
Stateful jobs will get suspended, while stateless jobs will get requeued.
And then there are some general guard rails around stateful jobs clogging the 
queue and preventing stateless jobs from scheduling and creating a deadlock, 
but thats not the specific issue here.

> SchedulerType=sched/backfill
> SelectType=select/cons_tres
> SelectTypeParameters=CR_CPU_Memory
> SchedulerParameters=max_rpc_cnt=500,\
> sched_min_interval=50000,\
> sched_max_job_start=300,\
> batch_sched_delay=6
> PriorityType=priority/multifactor
> PreemptType=preempt/qos
> PreemptMode=SUSPEND,GANG

Workload is rather "high throughput", so a few settings were influenced by that 
<https://slurm.schedmd.com/high_throughput.html> guide.

What I end up seeing is that it can take sometimes, but not always, take a 
while for lo jobs to be preempted by hi jobs, an example below, but if the 
mailing list eats images, a link to it here: https://imgur.com/a/7xVFC8a
This is a rather low resolution view, as it is just a scraper running on 5 
minute increments, but the blue filled area are "hi" jobs running, where the 
yellow are "lo" jobs running.
Second image (dashed lines) are pending jobs for the same partitions.
Oddly, this specific instance did not show any preemption events in the 
slurmctld logs, but users/admins were a bit perplexed as to why this drug on 
for so long, and without preemption kicking in.

I was considering looking deeper into to try to better understand and predict 
preemption decisions:
default_queue_depth
partition_job_depth
sched_interval
sched_min_interval 
defer

Hopefully someone can point me to some nuggets of information around this?
Appreciate any pointers,
Reed

smime.p7s
Description: S/MIME cryptographic signature

-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[slurm-users] Preemption Scheduler Tuning Knobs

Reply via email to