Hi, >From what we observed, Slurm sees the MIGs each as a distinct gres/gpu. So you can have 14 jobs each using a different MIG. However (unless something has changed in the past year), due to nvidia limitations, a single process can't access more than one MIG simultaneously (this is unrelated to Slurm). So while you can have a user request a Slurm job with 2 gpus (MIGs), they'll have to run two distinct processes within that job in order to utilize those two MIGs.
HTH, On Tue, 15 Nov 2022 at 23:42, Laurence <laurence.fi...@cern.ch> wrote: > Hi Rob, > > > Yes, those questions make sense. From what I understand, MIG should > essentially split the GPU so that they behave as separate cards. Hence two > different users should be able to use two different MIG instances at the > same time and also a single job could use all 14 instances. The result you > observed suggests that MIG is a feature of the driver i.e lspci shows one > device but nvidia-smi shows 7 devices. > > > I haven't played around with this myself in slurm but would be interested > to know the answers. > > > Laurence > > > On 15/11/2022 17:46, Groner, Rob wrote: > > We have successfully used the nvidia-smi tool to take the 2 A100's in a > node and split them into multiple GPU devices. In one case, we split the 2 > GPUS into 7 MIG devices each, so 14 in that node total, and in the other > case, we split the 2 GPUs into 2 MIG devices each, so 4 total in the node. > > From our limited testing so far, and from the "sinfo" output, it appears > that slurm might be considering all of the MIG devices on the node to be in > the same socket (even though the MIG devices come from two separate > graphics cards in the node). The sinfo output says (S:0) after the 14 > devices are shown, indicating they're in socket 0. That seems to be > preventing 2 different users from using MIG devices at the same time. Am I > wrong that having 14 MIG gres devices show up in slurm should mean that, in > theory, 14 different users could use one at the same time? > > Even IF that doesn't work....if I have 14 devices spread across 2 physical > GPU cards, can one user utilize all 14 for a single job? I would hope that > slurm would treat each of the MIG devices as its own separate card, which > would mean 14 different jobs could run at the same time using their own > particular MIG, right? > > Do those questions make sense to anyone? 🙂 > > Rob > > > -- /| | \/ | Yair Yarom | System Group (DevOps) [] | The Rachel and Selim Benin School [] /\ | of Computer Science and Engineering []//\\/ | The Hebrew University of Jerusalem [// \\ | T +972-2-5494522 | F +972-2-5494522 // \ | ir...@cs.huji.ac.il // |