Hi list, We have GrpTRES limits on all accounts which causes a lot of higher priority jobs to stay in the queue due to limits. As such we rely heavily on the backfill scheduler. We also have a special lower priority preemptable QOS with no limits.
We've noticed that when the cluster is loaded, sending a non preemptable but not highest priority job, will cause the backfill algorithm to fail to start the job when it needs to kill preemptable jobs. The preemptable jobs are killed, but the job doesn't start. >From the logs, for job 3617065: [2020-11-15T13:36:01.928] backfill test for JobId=3617065 Prio=680634 Partition=short [2020-11-15T13:36:12.947] _preempt_jobs: preempted JobId=3616258 had to be killed [2020-11-15T13:36:12.953] _preempt_jobs: preempted JobId=3616259 had to be killed [2020-11-15T13:36:12.960] _preempt_jobs: preempted JobId=3616255 had to be killed [2020-11-15T13:36:12.966] _preempt_jobs: preempted JobId=3616256 had to be killed [2020-11-15T13:36:12.972] _preempt_jobs: preempted JobId=3616257 had to be killed [2020-11-15T13:36:12.973] backfill: planned start of JobId=3617065 failed: Requested nodes are busy [2020-11-15T13:36:12.973] JobId=3617065 to start at 2020-11-15T13:36:01, end at 2020-11-15T15:36:00 on nodes dumfries-002 in partition short Looking at job 3616258 which was preempted on time: $ sacct -j 3616258 -ojobid,end,state JobID End State ------------ ------------------- ---------- 3616258 2020-11-15T13:36:12 PREEMPTED 3616258.bat+ 2020-11-15T13:36:50 CANCELLED 3616258.ext+ 2020-11-15T13:36:13 COMPLETED The job was preempted at 13:36:12, but the batch script was finished only at 13:36:50. By then the backfill already gave up. The job will start in one of the subsequence backfill cycles, but in some cases this can take more than 30 minutes. Is this intentional? i.e. that the backfill will preempt jobs on the first cycle, and run the "real" job on the second (or later) cycle? Has anyone else encountered this? Our slurm is 19.05.1, with KillWait=30 (we want to keep this above 0), CompleteWait=0, and the SchedulerFlags (which was changed numerous times in the past weeks) currently includes: batch_sched_delay=5 bf_busy_nodes bf_continue bf_interval=90 bf_max_job_test=2500 bf_max_job_user_part=30 bf_max_time=270 bf_min_prio_reserve=1000000 bf_window=30300 bf_yield_interval=5000000 default_queue_depth=2000 defer kill_invalid_depend max_rpc_cnt=150 preempt_strict_order sched_interval=120 sched_min_interval=1000000 Thanks in advance, Yair.