Re: [slurm-users] What happens if GPU GRES exceeding number of GPUs per node

Juergen Salk Thu, 18 Jan 2024 00:23:44 -0800

Hi Wirawan,

in general `--gres=gpu:6´ actually means six units of a generic resource named 
`gpu´ 
per node. Each unit may or may not be associated with a physical GPU device.


I'd check the node configuration for the number of gres=gpu resource units that 
are 
configured for that node.

  scontrol show node <node>

Maybe your GPU devices are multi instance GPUs (MIG) with each one being split 
into 
multiple separate GPU instances and every gres=gpu unit counts against the 
total number 
of MIG instances rather than the number of physical GPU devices on the nodes? 

Best regards
Jürgen


* Purwanto, Wirawan <wpurw...@odu.edu> [240117 15:54]:
> Hi,
> 
> In my HPC center, I found a SLURM job that was submitted with --gres=gpu:6 
> whereas the cluster has only four GPUs per node each. It is a parallel job. 
> Here are some relevant field printout:
> 
> AllocCPUS                                      30
> AllocGRES                                   gpu:6
> AllocTRES     billing=30,cpu=30,gres/gpu=6,node=3
> CPUTime                                1-01:23:00
> CPUTimeRAW                                  91380
> Elapsed                                  00:50:46
> JobID                                       20073
> JobIDRaw                                    20073
> JobName                               simple_cuda
> NCPUS                                          30
> NGPUS                                         6.0
> 
> What happened in this case? This job was asking for 3 nodes, 10 core per 
> node. When the user specified “--gres=gpu:6”, does this mean six GPUs for the 
> entire job, or six GPUs per node? Per the description in 
> https://slurm.schedmd.com/gres.html#Running_Jobs, it says: gres is “Generic 
> resources required per node”. So it is illogical to request six GPUs per 
> node. So what happened? Did SLURM quietly ignore the request and grant just 
> one, or grant the max number (4)? Because apparently the job ran without 
> error.
> 
> Wirawan Purwanto
> Computational Scientist, HPC Group
> Information Technology Services
> Old Dominion University
> Norfolk, VA 23529

smime.p7s
Description: S/MIME cryptographic signature

Re: [slurm-users] What happens if GPU GRES exceeding number of GPUs per node

Reply via email to