Hi all! I've successfully managed to configure slurm on one head node and two different compute nodes, one using "old" consumer RTX cards, a new one using 4xA100 GPUS (80gb version). I am now trying to set up a hybrid MIG configuration, where devices 0,1 are kept as is, while 2 and 3 are split into 3.40gb MIG instances.
MIG itself works well, I am able to keep 0,1 disabled and 2,3 enabled with 2x40gb. Trying to configure slurm with this had me lost: I am trying countless variations, but there isn't a single one that has worked so far. Here's what I have at the moment: - My gres.conf has gone from the full list to literally just "AutoDetect=nvml", slurmd -G returns a somewhat reasonable output: slurmd: gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system device(s) detected slurmd: Gres Name=gpu Type=a100 Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Cores=24-31 CoreCnt=128 Links=-1,4,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=a100_3g.39gb Count=1 Index=283 ID=7696487 File=/dev/nvidia2,/dev/nvidia-caps/nvidia-cap282,/dev/nvidia-caps/nvidia-cap283 Cores=56-63 CoreCnt=128 Links=-1,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=a100_3g.39gb Count=1 Index=418 ID=7696487 File=/dev/nvidia3,/dev/nvidia-caps/nvidia-cap417,/dev/nvidia-caps/nvidia-cap418 Cores=40-47 CoreCnt=128 Links=-1,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=a100 Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Cores=8-15 CoreCnt=128 Links=4,-1,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=a100_3g.39gb Count=1 Index=292 ID=7696487 File=/dev/nvidia2,/dev/nvidia-caps/nvidia-cap291,/dev/nvidia-caps/nvidia-cap292 Cores=56-63 CoreCnt=128 Links=0,-1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=a100_3g.39gb Count=1 Index=427 ID=7696487 File=/dev/nvidia3,/dev/nvidia-caps/nvidia-cap426,/dev/nvidia-caps/nvidia-cap427 Cores=40-47 CoreCnt=128 Links=0,-1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML And here I have the first doubt: *the MIG profile is supposed to be called 3g.40gb, why is it popping up as 3g.39gb?* - My slurm.conf is very similar to the documentation example, with: Gres=gpu:a100:2,gpu:a100_3g.39gb:4 - I restarted *slurmctld *and *slurmd *on the node, everything appears to be working. When I try to send a *srun *command, weird stuff happens: - srun --gres=gpu:a100:2 returns a non-mig device AND a mig device together - sinfo only shows 2 a100 gpus "*gpu:a100:2(S:1)*", or gpu count too low (0 < 4) for the MIG devices and stays in drain state - the fullly qualified name "gpu:a100_3g.39gb:1" returns "Unable to allocate resources: Requested node configuration is not available". *Where do I start to fix this mess?* Thank you for your patience! Cheers, Edoardo