Hi Wirawan, in general `--gres=gpu:6´ actually means six units of a generic resource named `gpu´ per node. Each unit may or may not be associated with a physical GPU device.
I'd check the node configuration for the number of gres=gpu resource units that are configured for that node. scontrol show node <node> Maybe your GPU devices are multi instance GPUs (MIG) with each one being split into multiple separate GPU instances and every gres=gpu unit counts against the total number of MIG instances rather than the number of physical GPU devices on the nodes? Best regards Jürgen * Purwanto, Wirawan <wpurw...@odu.edu> [240117 15:54]: > Hi, > > In my HPC center, I found a SLURM job that was submitted with --gres=gpu:6 > whereas the cluster has only four GPUs per node each. It is a parallel job. > Here are some relevant field printout: > > AllocCPUS 30 > AllocGRES gpu:6 > AllocTRES billing=30,cpu=30,gres/gpu=6,node=3 > CPUTime 1-01:23:00 > CPUTimeRAW 91380 > Elapsed 00:50:46 > JobID 20073 > JobIDRaw 20073 > JobName simple_cuda > NCPUS 30 > NGPUS 6.0 > > What happened in this case? This job was asking for 3 nodes, 10 core per > node. When the user specified “--gres=gpu:6”, does this mean six GPUs for the > entire job, or six GPUs per node? Per the description in > https://slurm.schedmd.com/gres.html#Running_Jobs, it says: gres is “Generic > resources required per node”. So it is illogical to request six GPUs per > node. So what happened? Did SLURM quietly ignore the request and grant just > one, or grant the max number (4)? Because apparently the job ran without > error. > > Wirawan Purwanto > Computational Scientist, HPC Group > Information Technology Services > Old Dominion University > Norfolk, VA 23529
smime.p7s
Description: S/MIME cryptographic signature