Hi, With regards to 2. If you're using AccountingStorageTres, I think you can specify each gres/gpu:<type> to be monitored in addition to the generic gres/gpu. And then have for all accounts "GrpTRES=gres/gpu=0" so they won't be able to use gres/gpu, but only gres/gpu:<type>.
We haven't tried this, but it's been on our todo list for a while now. So I'd like to know if it works :) On Wed, 29 Mar 2023 at 21:31, <collin.m.mccar...@gmail.com> wrote: > Hello, > > > > Apologies if this is in the docs but I couldn’t find it anywhere. > > > > I’ve been using Slurm to run a small 7-node cluster in a research lab for > a couple of years now (I’m a PhD student). A couple of our nodes have > heterogenous GPU models. One in particular has quite a few: 2x NVIDIA > A100s, 1x NVIDIA 3090, 2x NVIDIA GV100 w/ NVLink, 1x AMD MI100, 2x AMD > MI200. This makes things a bit challenging but I need to work with what I > have. > > > > 1. I’ve only been able to set this up previously on Slurm 20.02 by > “ignoring” the AMDs and just specifying the NVIDIA GPUs. That worked when > we had one or two people using the AMD GPUs and they could coordinate > between themselves. But now, we have more people interested. I’m upgrading > Slurm to 23.02 in hopes that might fix some of the challenges, but > should this be possible? Ideally I would like to have AutoDetect=nvml > and AutoDetect=rsmi both on. If it’s not I’ll shuffle GPUs around to > make this node NVIDIA-only. > 2. I want everyone to allocate GPUs with --gpus=<type>:<num> instead > of --gpus=<num>, so they don’t “block” a nice GPU like an A100 when > they really wanted any-old GPU on the machine like a GV100 or 3090. Can I > force people to specify a GPU type and not just a count? This is especially > important if I’m mixing AMDs and NVIDIAs on the same node. If not, can I > specify the “order” in which I want GPUs to be scheduled if they don’t > specify a type (so they get handed out from least-powerful to most-powerful > if people don’t care)? > > > > Any help and/or advice here is much appreciated. Slurm has been amazing > for our lab (albeit challenging to setup at first) and I want to get > everything dialed before I graduate :D . > > > > Thanks, > > -Collin > -- /| | \/ | Yair Yarom | System Group (DevOps) [] | The Rachel and Selim Benin School [] /\ | of Computer Science and Engineering []//\\/ | The Hebrew University of Jerusalem [// \\ | T +972-2-5494522 | F +972-2-5494522 // \ | ir...@cs.huji.ac.il // |