The former- jobs should run but are not.
We currently have these backfill parameters set:
bf_continue,bf_max_job_user=10.
bf_max_job_test is the default of 500.  However sdiag says the number of
times bf_max_job_test has been hit is zero, so that's probably not relevant.
I can try removing bf_max_job_user, but I don't think that's the issue
either, as this problem also seems to affect users with few jobs in queue
when a different user has all of one GPU type consumed.

Kevin



On Thu, Sep 11, 2025 at 3:38 PM Ryan Novosielski <[email protected]>
wrote:

> Are you saying these are jobs that should be able to run right now but
> they’re just not getting considered, or there’s something that’s wrong
> about the way they’re submitted that has to be manually corrected to allow
> them to run on A100s?
>
> If the former, it sounds like your backfill settings just might be
> inadequate to allow it to consider jobs far enough down the list.
>
> --
> #BlackLivesMatter
> ____
> || \\UTGERS,     |---------------------------*O*---------------------------
> ||_// the State  |     Ryan Novosielski (he/him) - [email protected]
> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
> ||  \\    of NJ  | Office of Advanced Research Computing - MSB
> A555B, Newark
>      `'
>
> On Sep 11, 2025, at 15:23, Kevin M. Hildebrand via slurm-users <
> [email protected]> wrote:
>
> We have several different types of GPUs in the same 'gpu' partition.  The
> problem we're having occurs when one of those types of GPUs is fully
> occupied and there are a bunch of queued jobs waiting for those GPUs.  If
> someone requests idle GPUs of a different type, those jobs end up getting
> stalled, even though there are plenty of GPUs available.
>
> For example, say we have 10 A100 GPUs and 10 H100 GPUs.  If there are 10
> H100 GPU jobs running and more in queue waiting for them, subsequently
> submitted A100 jobs will sit in queue even if there are plenty of idle A100
> GPUs.  The only way we can get the A100 jobs to run is by manually bumping
> their priority higher than the pending H100 jobs.
>
> Has anyone else encountered this issue?  The only way we can think of to
> potentially solve it is to have separate partitions for each GPU type, but
> that seems unwieldy.
>
> We are currently running Slurm 24.05.8.
>
> Thanks,
> Kevin
>
> --
> Kevin Hildebrand
> Director of Research Technology and HPC Services
> Division of IT
>
>
> --
> slurm-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
>
>
>
-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to