The problem appears to be using AutoDetect=nvml in the gres.conf file.  When we 
remove that and fully specify everything (with help from the 
https://gitlab.com/nvidia/hpc/slurm-mig-discovery tool) then I am able to 
submit jobs allocating all of the MIG gpus at once, or submit X jobs asking for 
just 1 gpu, without them going to pending (until all gpus are used up).

Rob


________________________________
From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Groner, 
Rob <rug...@psu.edu>
Sent: Thursday, November 17, 2022 10:08 AM
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] NVIDIA MIG question

No, I can't submit more than 7 individual jobs and have them all run, the jobs 
after the first 7 will go to pending until the first 7 finish.

And it's not a limit (at least, not of "7"), because here's the same problem 
but with a node configured for 2x3g.20gb per card (2 cards, so, 4 total MIG 
gpus in the node)

[rug262@testsch (RC) slurm] sinfo -o "%20N  %10c  %10m  %25f  %40G "
NODELIST              CPUS        MEMORY      AVAIL_FEATURES             GRES
t-gc-1201             48          358400      3gc20gb                    
gpu:nvidia_a100_3g.20gb:4(S:0)

So, there are 4 of them on that node


[rug262@testsch (RC) slurm] sbatch --gpus=1 --cpus-per-task=2 --partition=debug 
--nodelist=t-gc-1201 --wrap="sleep 100"

I submit 3 jobs, each asking for 1 gpu from that node


[rug262@testsch (RC) slurm] squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES 
NODELIST(REASON)
              5049     debug     wrap   rug262 PD       0:00      1 (Resources)
              5048     debug     wrap   rug262  R       0:09      1 t-gc-1201
              5047     debug     wrap   rug262  R       0:31      1 t-gc-1201

The first 2 go fine, but any after that go to pending, even though there should 
be 4 available (according to sinfo output)

Rob



________________________________
From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Yair 
Yarom <ir...@cs.huji.ac.il>
Sent: Thursday, November 17, 2022 8:19 AM
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] NVIDIA MIG question

You don't often get email from ir...@cs.huji.ac.il. Learn why this is 
important<https://aka.ms/LearnAboutSenderIdentification>
Can you request more than 7 single gpu jobs on the same node?
It could be that there's another limit you've encountered (e.g. memory or cpu), 
or some other limit (in the account, partition, or qos)

On our setup we're limiting jobs to 1 gpu per job (via partition qos), however 
we can use up all the MIGs with single gpu jobs.


On Wed, 16 Nov 2022 at 23:48, Groner, Rob 
<rug...@psu.edu<mailto:rug...@psu.edu>> wrote:
That does help, thanks for the extra info.

If I have two separate GPU cards in the node, and I setup 7 MIGs on each card, 
for a total of 14 MIG "gpus" in the node...then, SHOULD I be able to salloc 
requesting, say 10 GPUs (7 from 1 card, 3 from the other)?  Because I can't.

I can request up to 7 just fine.  When I request more than that, it adds in 
other nodes to try to give me that, even though there are theoretically 14 on 
the one node.  When I ask for 8, it gives me 7 from t-gc-1202 and then 1 from 
t-gc-1201.  When I ask for 10, then it fails because it can't give me 10 
without using 2 cards in one node.


[rug262@testsch ~ ]# sinfo -o "%20N  %10c  %10m  %25f  %50G "
NODELIST              CPUS        MEMORY      AVAIL_FEATURES             GRES
t-gc-1201             48          358400      3gc20gb                    
gpu:nvidia_a100_3g.20gb:4(S:0)
t-gc-1202             48          358400      1gc5gb                     
gpu:nvidia_a100_1g.5gb:14(S:0)


[rug262@testsch (RC) ~] salloc --gpus=10 --account=1gc5gb --partition=sla-prio
salloc: Job allocation 5015 has been revoked.
salloc: error: Job submit/allocate failed: Requested node configuration is not 
available


Rob

________________________________
From: slurm-users 
<slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>>
 on behalf of Yair Yarom <ir...@cs.huji.ac.il<mailto:ir...@cs.huji.ac.il>>
Sent: Wednesday, November 16, 2022 3:48 AM
To: Slurm User Community List 
<slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>>
Subject: Re: [slurm-users] NVIDIA MIG question

You don't often get email from ir...@cs.huji.ac.il<mailto:ir...@cs.huji.ac.il>. 
Learn why this is important<https://aka.ms/LearnAboutSenderIdentification>
Hi,

From what we observed, Slurm sees the MIGs each as a distinct gres/gpu. So you 
can have 14 jobs each using a different MIG.
However (unless something has changed in the past year), due to nvidia 
limitations, a single process can't access more than one MIG simultaneously 
(this is unrelated to Slurm). So while you can have a user request a Slurm job 
with 2 gpus (MIGs), they'll have to run two distinct processes within that job 
in order to utilize those two MIGs.

HTH,


On Tue, 15 Nov 2022 at 23:42, Laurence 
<laurence.fi...@cern.ch<mailto:laurence.fi...@cern.ch>> wrote:

Hi Rob,


Yes, those questions make sense. From what I understand, MIG should essentially 
split the GPU so that they behave as separate cards. Hence two different users 
should be able to use two different MIG instances at the same time and also a 
single job could use all 14 instances. The result you observed suggests that 
MIG is a feature of the driver i.e lspci shows one device but nvidia-smi shows 
7 devices.


I haven't played around with this myself in slurm but would be interested to 
know the answers.


Laurence


On 15/11/2022 17:46, Groner, Rob wrote:
We have successfully used the nvidia-smi tool to take the 2 A100's in a node 
and split them into multiple GPU devices.  In one case, we split the 2 GPUS 
into 7 MIG devices each, so 14 in that node total, and in the other case, we 
split the 2 GPUs into 2 MIG devices each, so 4 total in the node.

From our limited testing so far, and from the "sinfo" output, it appears that 
slurm might be considering all of the MIG devices on the node to be in the same 
socket (even though the MIG devices come from two separate graphics cards in 
the node).  The sinfo output says (S:0) after the 14 devices are shown, 
indicating they're in socket 0.  That seems to be preventing 2 different users 
from using MIG devices at the same time.  Am I wrong that having 14 MIG gres 
devices show up in slurm should mean that, in theory, 14 different users could 
use one at the same time?

Even IF that doesn't work....if I have 14 devices spread across 2 physical GPU 
cards, can one user utilize all 14 for a single job?  I would hope that slurm 
would treat each of the MIG devices as its own separate card, which would mean 
14 different jobs could run at the same time using their own particular MIG, 
right?

Do those questions make sense to anyone?  🙂

Rob




--

  /|       |
  \/       | Yair Yarom | System Group (DevOps)
  []       | The Rachel and Selim Benin School
  [] /\    | of Computer Science and Engineering
  []//\\/  | The Hebrew University of Jerusalem
  [//  \\  | T +972-2-5494522 | F +972-2-5494522
  //    \  | ir...@cs.huji.ac.il<mailto:ir...@cs.huji.ac.il>
 //        |



--

  /|       |
  \/       | Yair Yarom | System Group (DevOps)
  []       | The Rachel and Selim Benin School
  [] /\    | of Computer Science and Engineering
  []//\\/  | The Hebrew University of Jerusalem
  [//  \\  | T +972-2-5494522 | F +972-2-5494522
  //    \  | ir...@cs.huji.ac.il<mailto:ir...@cs.huji.ac.il>
 //        |

Reply via email to