Hello,

I currently manage a small cluster separated into 4 partitions. I am 
experiencing unexpected behavior with the scheduler when the queue has been 
flooded with a large number of jobs by a single user (around 60000) to a single 
partition. We have each user bound to a global grptres CPU limit. Once this 
user reaches their CPU limit the jobs are queued with reason 
“AssocGroupCpuLimit” but after a few hundred or so of the jobs it seems to 
switch to “Priority”. The issue is that once this switch occurs it appears to 
also impact all other partitions. Currently if any job is submitted to any of 
the partitions, regardless of resources available, they are all queued by the 
scheduler with the reason of “Priority”. We had the scheduler initially 
configured for backfill but have also tried switching to builtin and it did not 
seem to make a difference. I tried increasing the default_queue_depth to 100000 
and it didn’t seem to help. The scheduler log is also unhelpful as it simply 
lists the accounting-limited jobs and never mentions the “Priority” queued jobs:

sched: [2021-06-11T13:21:53.993] JobId=495780 delayed for accounting policy
sched: [2021-06-11T13:21:53.997] JobId=495781 delayed for accounting policy
sched: [2021-06-11T13:21:54.001] JobId=495782 delayed for accounting policy
sched: [2021-06-11T13:21:54.005] JobId=495783 delayed for accounting policy
sched: [2021-06-11T13:21:54.005] loop taking too long, breaking out

I’ve gone through all the documentation I’ve found on the scheduler and cannot 
seem to resolve this. I’m hoping I’m simply missing something.

Any help would be great. Thank you!

Jason

Reply via email to