Hoping someone may be able to help demystify some questions around the scheduler and preemption decisions.
I'm trying to help better predict the scheduler behavior as it pertains to preemption to create more predictable scheduling. Version is 24.11.5, OS U22.04 I have a fairly simple lo/hi partition which the same set of nodes are assigned > PartitionName=partition-lo Nodes=foo[00..NN] Default=YES > MaxTime=INFINITE OverSubscribe=FORCE:1 State=UP > PartitionName=partition-hi Nodes=foo[00..NN] Default=YES > MaxTime=INFINITE OverSubscribe=NO State=UP PreemptMode=OFF And then I have 2 QOSs per partition > Name Priority GraceTime Preempt PreemptMode > UsageFactor MaxJobsPU MaxTRESPA MaxJobsPA > ---------------- ---------- ---------- ---------------------- ----------- > ----------- --------- ------------- --------- > qos-stateless-lo 1 00:00:00 requeue > 1.000000 > qos-stateful-lo 1 00:00:00 suspend > 1.000000 NN cpu=NNN NNN > qos-stateful-hi 5 00:00:00 qos-state[ful,less]-lo cluster > 1.000000 cpu=NNN > qos-stateless-hi 5 00:00:00 qos-state[ful,less]-lo cluster > 1.000000 And the general way that it works out is that stateful jobs will typically spawn stateless jobs. Stateful jobs will get suspended, while stateless jobs will get requeued. And then there are some general guard rails around stateful jobs clogging the queue and preventing stateless jobs from scheduling and creating a deadlock, but thats not the specific issue here. > SchedulerType=sched/backfill > SelectType=select/cons_tres > SelectTypeParameters=CR_CPU_Memory > SchedulerParameters=max_rpc_cnt=500,\ > sched_min_interval=50000,\ > sched_max_job_start=300,\ > batch_sched_delay=6 > PriorityType=priority/multifactor > PreemptType=preempt/qos > PreemptMode=SUSPEND,GANG Workload is rather "high throughput", so a few settings were influenced by that <https://slurm.schedmd.com/high_throughput.html> guide. What I end up seeing is that it can take sometimes, but not always, take a while for lo jobs to be preempted by hi jobs, an example below, but if the mailing list eats images, a link to it here: https://imgur.com/a/7xVFC8a This is a rather low resolution view, as it is just a scraper running on 5 minute increments, but the blue filled area are "hi" jobs running, where the yellow are "lo" jobs running. Second image (dashed lines) are pending jobs for the same partitions. Oddly, this specific instance did not show any preemption events in the slurmctld logs, but users/admins were a bit perplexed as to why this drug on for so long, and without preemption kicking in. I was considering looking deeper into to try to better understand and predict preemption decisions: default_queue_depth partition_job_depth sched_interval sched_min_interval defer Hopefully someone can point me to some nuggets of information around this? Appreciate any pointers, Reed
smime.p7s
Description: S/MIME cryptographic signature
-- slurm-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
