[slurm-users] Re: Scheduling issues with multiple different types of GPU in one partition

Cutts, Tim via slurm-users Wed, 17 Sep 2025 08:32:43 -0700

We have heterogeneous partitions too.  We see this occasionally, but it’s not a 
huge problem.  The way we have things set up is all the nodes are shared by 
three partitions; short-gpu, medium-gpu and long-gpu.  The difference between 
the partitions is the priority and the partition QoS.  Short-gpu has the 
highest priority, and allows the highest proportion of the GPUs to be used by a 
single user, but has short maximum time limit for the jobs (2 hours).  
Conversely, long-gpu doesn’t let the user use many GPUs, but they can run for a 
long time.  Medium-gpu, obviously, is somewhere between the two.    This seems 
to work reasonably well, and I can usually get a GPU for a short job almost 
immediately.


I would check your priority weights - if you have job age dominating in the 
priority calculation, you’re likely to have issues where young jobs don’t run, 
even if they fit, with the resulting situation being what you see.  We try to 
set priority so that Fairshare dominates while jobs are young, and it’s only if 
they’ve been pending for a long time that age really starts to overtake fair 
share.  We also set QoS priority weight very high, so that really critical jobs 
go straight to the top of the queue, but those qos’s are always tightly 
constrained to a very small number of resources (we have a ‘priority’ qos, but 
it only allows the user to consume 16 CPUs and a single GPU)

I have to say, I find this to be an area where SLURM is a bit weaker than some 
other schedulers.  It’s very difficult, sometimes, to really understand why a 
particular job isn’t running.  I used to be an LSF administrator, and I really 
loved the ‘bjobs -l -p’ command in LSF, which tells you exactly why a job 
cannot be run on each node, and the answer can be different in each case.

Tim

From: Paul Edmon via slurm-users <[email protected]>
Date: Thursday, 11 September 2025 at 20:36
To: [email protected] <[email protected]>
Subject: [slurm-users] Re: Scheduling issues with multiple different types of 
GPU in one partition


Yes, we've see the same thing with mosaic/heterogeneous partitions. Our 
solution is to split based on hardware type.

Having a bunch of partitions may seem unwieldy but the scheduler can handle it. 
For instance we have 110 partitions and the scheduler handles it fine (most of 
those are hardware owned by specific groups not public partitions everyone can 
see). We've taken up the convention of naming our partitions after the hardware 
type. For instance we have a gpu partition (our A100's) and a gpu_h200 
partition. Making it easy for people to identify the hardware. People who can 
use both will leverage mutltipartition submission ala #SBATCH -p gpu,gpu_h200.

I don't know of a good solution if you want to keep the mosiac partition as it 
really requires you users to think at a higher level and realize there is 
vacant hardware that could be used if they just selected a different gpu type. 
Having a separate partition makes it much easier to see.

-Paul Edmon-

On 9/11/2025 3:23 PM, Kevin M. Hildebrand via slurm-users wrote:
We have several different types of GPUs in the same 'gpu' partition.  The 
problem we're having occurs when one of those types of GPUs is fully occupied 
and there are a bunch of queued jobs waiting for those GPUs.  If someone 
requests idle GPUs of a different type, those jobs end up getting stalled, even 
though there are plenty of GPUs available.

For example, say we have 10 A100 GPUs and 10 H100 GPUs.  If there are 10 H100 
GPU jobs running and more in queue waiting for them, subsequently submitted 
A100 jobs will sit in queue even if there are plenty of idle A100 GPUs.  The 
only way we can get the A100 jobs to run is by manually bumping their priority 
higher than the pending H100 jobs.

Has anyone else encountered this issue?  The only way we can think of to 
potentially solve it is to have separate partitions for each GPU type, but that 
seems unwieldy.

We are currently running Slurm 24.05.8.

Thanks,
Kevin

--
Kevin Hildebrand
Director of Research Technology and HPC Services
Division of IT


________________________________

AstraZeneca UK Limited is a company incorporated in England and Wales with 
registered number:03674842 and its registered office at 1 Francis Crick Avenue, 
Cambridge Biomedical Campus, Cambridge, CB2 0AA.

This e-mail and its attachments are intended for the above named recipient only 
and may contain confidential and privileged information. If they have come to 
you in error, you must not copy or show them to anyone; instead, please reply 
to this e-mail, highlighting the error to the sender and then immediately 
delete the message. For information about how AstraZeneca UK Limited and its 
affiliates may process information, personal data and monitor communications, 
please see our privacy notice at 
www.astrazeneca.com<https://www.astrazeneca.com>

-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[slurm-users] Re: Scheduling issues with multiple different types of GPU in one partition

Reply via email to