Hello,

We see the following issue with smaller jobs pushing back large jobs. We are 
using slurm 19.05.8 so not sure if this is patched in newer releases. With a 4 
node test partition I submit 3 jobs as 2 users



ssh hpcdev1@navy51 'sbatch --nodes=3 --ntasks-per-node=40 
--partition=backfilltest --time=120 --wrap="sleep 7200"'

ssh hpcdev2@navy51 'sbatch --nodes=4 --ntasks-per-node=40 
--partition=backfilltest --time=60 --wrap="sleep 3600"'

ssh hpcdev2@navy51 'sbatch --nodes=4 --ntasks-per-node=40 
--partition=backfilltest --time=60 --wrap="sleep 3600"'



Then I increase the priority of the pending jobs significantly. Reading the 
manual, my understanding is that nodes job should be held for these jobs.

for job in $(squeue -h -p backfilltest -t pd -o %i); do scontrol update job 
${job} priority=1000000000;done



squeue -p backfilltest -o "%i | %u | %C | %Q | %l | %S | %T"

JOBID | USER | CPUS | PRIORITY | TIME_LIMIT | START_TIME | STATE

28482 | hpcdev2 | 160 | 1000000000 | 1:00:00 | N/A | PENDING

28483 | hpcdev2 | 160 | 1000000000 | 1:00:00 | N/A | PENDING

28481 | hpcdev1 | 120 | 50083 | 2:00:00 | 2020-12-08T09:44:15 | RUNNING



So, there is one node free in our 4 node partition. Naturally, a small job with 
a walltime of less than 1 hour could run in that but we are also seeing 
backfill start longer jobs.



backfilltest    up 2-12:00:00      3  alloc reddev[001-003]

backfilltest    up 2-12:00:00      1   idle reddev004





ssh hpcdev3@navy51 'sbatch --nodes=1 --ntasks-per-node=40 
--partition=backfilltest --time=720 --wrap="sleep 432000"'





squeue -p backfilltest -o "%i | %u | %C | %Q | %l | %S | %T"

JOBID | USER | CPUS | PRIORITY | TIME_LIMIT | START_TIME | STATE

28482 | hpcdev2 | 160 | 1000000000 | 1:00:00 | N/A | PENDING

28483 | hpcdev2 | 160 | 1000000000 | 1:00:00 | N/A | PENDING

28481 | hpcdev1 | 120 | 50083 | 2:00:00 | 2020-12-08T09:44:15 | RUNNING

28484 | hpcdev3 | 40 | 37541 | 12:00:00 | 2020-12-08T09:54:48 | RUNNING



Is this expect behaviour? It is also weird that the pending jobs don't have a 
start time. I have increased the backfill parameters significantly, but it 
doesn't seem to affect this at all.



SchedulerParameters=bf_window=14400,bf_resolution=2400,bf_max_job_user=80,bf_continue,default_queue_depth=1000,bf_interval=60


Best regards,

David

Reply via email to