Re: [slurm-users] NVIDIA MIG question

Laurence Tue, 15 Nov 2022 13:34:22 -0800

Hi Rob,

Yes, those questions make sense. From what I understand, MIG shouldessentially split the GPU so that they behave as separate cards. Hencetwo different users should be able to use two different MIG instances atthe same time and also a single job could use all 14 instances. Theresult you observed suggests that MIG is a feature of the driver i.elspci shows one device but nvidia-smi shows 7 devices.

I haven't played around with this myself in slurm but would beinterested to know the answers.



Laurence


On 15/11/2022 17:46, Groner, Rob wrote:

We have successfully used the nvidia-smi tool to take the 2 A100's ina node and split them into multiple GPU devices. In one case, wesplit the 2 GPUS into 7 MIG devices each, so 14 in that node total,and in the other case, we split the 2 GPUs into 2 MIG devices each, so4 total in the node.
From our limited testing so far, and from the "sinfo" output, itappears that slurm might be considering all of the MIG devices on thenode to be in the same socket (even though the MIG devices come fromtwo separate graphics cards in the node). The sinfo output says (S:0)after the 14 devices are shown, indicating they're in socket 0. Thatseems to be preventing 2 different users from using MIG devices at thesame time. Am I wrong that having 14 MIG gres devices show up inslurm should mean that, in theory, 14 different users could use one atthe same time?
Even IF that doesn't work....if I have 14 devices spread across 2physical GPU cards, can one user utilize all 14 for a single job? Iwould hope that slurm would treat each of the MIG devices as its ownseparate card, which would mean 14 different jobs could run at thesame time using their own particular MIG, right?
Do those questions make sense to anyone? 🙂

Rob

Re: [slurm-users] NVIDIA MIG question

Reply via email to