The former- jobs should run but are not. We currently have these backfill parameters set: bf_continue,bf_max_job_user=10. bf_max_job_test is the default of 500. However sdiag says the number of times bf_max_job_test has been hit is zero, so that's probably not relevant. I can try removing bf_max_job_user, but I don't think that's the issue either, as this problem also seems to affect users with few jobs in queue when a different user has all of one GPU type consumed.
Kevin On Thu, Sep 11, 2025 at 3:38 PM Ryan Novosielski <[email protected]> wrote: > Are you saying these are jobs that should be able to run right now but > they’re just not getting considered, or there’s something that’s wrong > about the way they’re submitted that has to be manually corrected to allow > them to run on A100s? > > If the former, it sounds like your backfill settings just might be > inadequate to allow it to consider jobs far enough down the list. > > -- > #BlackLivesMatter > ____ > || \\UTGERS, |---------------------------*O*--------------------------- > ||_// the State | Ryan Novosielski (he/him) - [email protected] > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus > || \\ of NJ | Office of Advanced Research Computing - MSB > A555B, Newark > `' > > On Sep 11, 2025, at 15:23, Kevin M. Hildebrand via slurm-users < > [email protected]> wrote: > > We have several different types of GPUs in the same 'gpu' partition. The > problem we're having occurs when one of those types of GPUs is fully > occupied and there are a bunch of queued jobs waiting for those GPUs. If > someone requests idle GPUs of a different type, those jobs end up getting > stalled, even though there are plenty of GPUs available. > > For example, say we have 10 A100 GPUs and 10 H100 GPUs. If there are 10 > H100 GPU jobs running and more in queue waiting for them, subsequently > submitted A100 jobs will sit in queue even if there are plenty of idle A100 > GPUs. The only way we can get the A100 jobs to run is by manually bumping > their priority higher than the pending H100 jobs. > > Has anyone else encountered this issue? The only way we can think of to > potentially solve it is to have separate partitions for each GPU type, but > that seems unwieldy. > > We are currently running Slurm 24.05.8. > > Thanks, > Kevin > > -- > Kevin Hildebrand > Director of Research Technology and HPC Services > Division of IT > > > -- > slurm-users mailing list -- [email protected] > To unsubscribe send an email to [email protected] > > >
-- slurm-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
