Hello list, My cluster usually has a pretty heterogenous job load and spends a lot of the time memory bound. Ocassionally I have users that submit 100k+ short, low resource jobs. Despite having several thousand free cores and enough RAM to run the jobs, the backfill scheduler would never backfill them. It turns out that there were a number of factors: They were deep down in the queue, from an account with low priority, and there were a lot of them for the scheduler to consider. After a bunch of tuning, the backfill scheduler parameters I finally settled on are:
SchedulerParameters=defer,bf_continue,bf_interval=20,bf_resolution=600,bf_yield_interval=1000000,sched_min_interval=2000000,bf_max_time=600,bf_max_job_test=1000000 After implementing these changes the backfill scheduler began to successfully schedule these jobs on the cluster. While the cluster has a deep queue, the load on the slurmctld host can get pretty high. However no users have reported issues with responsivenes of the various slurm commands and the backup controller has never taken over either. Changes have been in place for a month or so with no ill effects that I have observed. While I was troubleshooting I was definitely combing the list archives for other people's tuning suggestions, so I figured I would post a message here for posterity as well as see if anyone has similiar settings or feedback :-). Cheers, Richard
signature.asc
Description: PGP signature