[slurm-users] Scheduling oddity with multiple GPU types in same partition

2024-10-25 Thread Kevin M. Hildebrand via slurm-users
We have a 'gpu' partition with 30 or so nodes, some with A100s, some with H100s, and a few others. It appears that when (for example) all of the A100 GPUs are in use, if there are additional jobs requesting A100 GPUs pending, and those jobs have the highest priority in the partition, then jobs subm

[slurm-users] Poor scheduler performance with moderate number of jobs

2018-06-11 Thread Kevin M. Hildebrand
We're seeing some pretty bad performance with around 3000 jobs in queue. We're using sched/backfill, and I've been tweaking the bf_ parameters to try and improve some things, with limited results. But even before the backfill process starts, the main scheduling loop is taking so long per job that i

Re: [slurm-users] Account Usage Discrepancies

2017-11-28 Thread Kevin M. Hildebrand
Sounds suspiciously similar to a bug we reported a very long time ago, and that I'd submitted a patch for: https://bugs.schedmd.com/show_bug.cgi?id=1048 Which was then revisited here: https://bugs.schedmd.com/show_bug.cgi?id=2423 Though my fix handles a problem with a UsageFactor other than 1, I'

[slurm-users] SC17 - Tools for managing users and allocations

2017-11-15 Thread Kevin M. Hildebrand
At 10:10AM Thursday morning, Tom Payerle will be presenting a brief summary of tools he has developed at the University of Maryland for managing users and allocations. These tools work with the existing SLURM account and allocation management framework and provide a much richer way to view and man