Hello list,

My cluster usually has a pretty heterogenous job load and spends a lot of the 
time memory bound.  Ocassionally I have users that submit 100k+ short, low 
resource jobs.  Despite having several thousand free cores and enough RAM to 
run the jobs, the backfill scheduler would never backfill them.  It turns out 
that there were a number of factors: They were deep down in the queue, from an 
account with low priority, and there were a lot of them for the scheduler to 
consider.  After a bunch of tuning, the backfill scheduler parameters I finally 
settled on are:

SchedulerParameters=defer,bf_continue,bf_interval=20,bf_resolution=600,bf_yield_interval=1000000,sched_min_interval=2000000,bf_max_time=600,bf_max_job_test=1000000

After implementing these changes the backfill scheduler began to successfully 
schedule these jobs on the cluster.  While the cluster has a deep queue, the 
load on the slurmctld host can get pretty high.  However no users have reported 
issues with responsivenes of the various slurm commands and the backup 
controller has never taken over either.  Changes have been in place for a month 
or so with no ill effects that I have observed.

While I was troubleshooting I was definitely combing the list archives for 
other people's tuning suggestions, so I figured I would post a message here for 
posterity as well as see if anyone has similiar settings or feedback :-).

Cheers,
Richard

Attachment: signature.asc
Description: PGP signature

Reply via email to