We have a 'gpu' partition with 30 or so nodes, some with A100s, some with
H100s, and a few others.
It appears that when (for example) all of the A100 GPUs are in use, if
there are additional jobs requesting A100 GPUs pending, and those jobs have
the highest priority in the partition, then jobs subm
We're seeing some pretty bad performance with around 3000 jobs in queue.
We're using sched/backfill, and I've been tweaking the bf_ parameters
to try and improve some things, with limited results.
But even before the backfill process starts, the main scheduling loop
is taking so long per job that i
Sounds suspiciously similar to a bug we reported a very long time ago,
and that I'd submitted a patch for:
https://bugs.schedmd.com/show_bug.cgi?id=1048
Which was then revisited here:
https://bugs.schedmd.com/show_bug.cgi?id=2423
Though my fix handles a problem with a UsageFactor other than 1, I'
At 10:10AM Thursday morning, Tom Payerle will be presenting a brief summary
of tools he has developed at the University of Maryland for managing users
and allocations. These tools work with the existing SLURM account and
allocation management framework and provide a much richer way to view and
man