We have successfully used the nvidia-smi tool to take the 2 A100's in a node 
and split them into multiple GPU devices.  In one case, we split the 2 GPUS 
into 7 MIG devices each, so 14 in that node total, and in the other case, we 
split the 2 GPUs into 2 MIG devices each, so 4 total in the node.

From our limited testing so far, and from the "sinfo" output, it appears that 
slurm might be considering all of the MIG devices on the node to be in the same 
socket (even though the MIG devices come from two separate graphics cards in 
the node).  The sinfo output says (S:0) after the 14 devices are shown, 
indicating they're in socket 0.  That seems to be preventing 2 different users 
from using MIG devices at the same time.  Am I wrong that having 14 MIG gres 
devices show up in slurm should mean that, in theory, 14 different users could 
use one at the same time?

Even IF that doesn't work....if I have 14 devices spread across 2 physical GPU 
cards, can one user utilize all 14 for a single job?  I would hope that slurm 
would treat each of the MIG devices as its own separate card, which would mean 
14 different jobs could run at the same time using their own particular MIG, 
right?

Do those questions make sense to anyone?  🙂

Rob


Reply via email to