Are you saying these are jobs that should be able to run right now but they’re 
just not getting considered, or there’s something that’s wrong about the way 
they’re submitted that has to be manually corrected to allow them to run on 
A100s?

If the former, it sounds like your backfill settings just might be inadequate 
to allow it to consider jobs far enough down the list.

--
#BlackLivesMatter
____
|| \\UTGERS,     |---------------------------*O*---------------------------
||_// the State  |     Ryan Novosielski (he/him) - [email protected]
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
     `'

On Sep 11, 2025, at 15:23, Kevin M. Hildebrand via slurm-users 
<[email protected]> wrote:

We have several different types of GPUs in the same 'gpu' partition.  The 
problem we're having occurs when one of those types of GPUs is fully occupied 
and there are a bunch of queued jobs waiting for those GPUs.  If someone 
requests idle GPUs of a different type, those jobs end up getting stalled, even 
though there are plenty of GPUs available.

For example, say we have 10 A100 GPUs and 10 H100 GPUs.  If there are 10 H100 
GPU jobs running and more in queue waiting for them, subsequently submitted 
A100 jobs will sit in queue even if there are plenty of idle A100 GPUs.  The 
only way we can get the A100 jobs to run is by manually bumping their priority 
higher than the pending H100 jobs.

Has anyone else encountered this issue?  The only way we can think of to 
potentially solve it is to have separate partitions for each GPU type, but that 
seems unwieldy.

We are currently running Slurm 24.05.8.

Thanks,
Kevin

--
Kevin Hildebrand
Director of Research Technology and HPC Services
Division of IT


--
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to