No problem! Glad it is working for you now. Best,
-Sean On Thu, Oct 27, 2022 at 1:46 PM Dominik Baack < dominik.ba...@cs.uni-dortmund.de> wrote: > Thank you very much! > > Those were the missing settings! > > I am not sure how I overlooked it for nearly two days, but I am happy that > its working now. > > Cheers > Dominik Baack > > > Am 27.10.2022 um 19:23 schrieb Sean Maxwell: > > It looks like you are missing some of the slurm.conf entries related to > enforcing the cgroup restrictions. I would go through the list here and > verify/adjust your configuration: > > https://slurm.schedmd.com/cgroup.conf.html#OPT_/etc/slurm/slurm.conf > > Best, > > -Sean > > > > On Thu, Oct 27, 2022 at 1:04 PM Dominik Baack < > dominik.ba...@cs.uni-dortmund.de> wrote: > >> Hi, >> >> yes ContrainDevices is set: >> >> ### >> # Slurm cgroup support configuration file >> ### >> CgroupAutomount=yes >> # >> #CgroupMountpoint="/sys/fs/cgroup" >> ConstrainCores=yes >> ConstrainDevices=yes >> ConstrainRAMSpace=yes >> # >> # >> >> I attached the slurm configuration file as well >> >> Cheers >> Dominik >> Am 27.10.2022 um 17:57 schrieb Sean Maxwell: >> >> Hi Dominik, >> >> Do you have ConstrainDevices=yes set in your cgroup.conf? >> >> Best, >> >> -Sean >> >> On Thu, Oct 27, 2022 at 11:49 AM Dominik Baack < >> dominik.ba...@cs.uni-dortmund.de> wrote: >> >>> Hi, >>> >>> We are in the process of setting up SLURM on some DGX A100 nodes . We >>> are experiencing the problem that all GPUs are available for users, even >>> for jobs where only one should be assigned. >>> >>> It seems the requirement is forwarded correctly to the node, at least >>> CUDA_VISIBLE_DEVICES is set to the correct id only discarded by the rest >>> of the system. >>> >>> Cheers >>> Dominik Baack >>> >>> Example: >>> >>> baack@gwkilab:~$ srun --gpus=1 nvidia-smi >>> Thu Oct 27 17:39:04 2022 >>> >>> +-----------------------------------------------------------------------------+ >>> | NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: >>> 11.4 | >>> >>> |-------------------------------+----------------------+----------------------+ >>> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile >>> Uncorr. ECC | >>> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util >>> Compute M. | >>> | | | MIG M. | >>> >>> |===============================+======================+======================| >>> | 0 NVIDIA A100-SXM... On | 00000000:07:00.0 Off >>> | 0 | >>> | N/A 28C P0 52W / 400W | 0MiB / 40536MiB | 0% Default >>> | >>> | | | Disabled | >>> >>> +-------------------------------+----------------------+----------------------+ >>> | 1 NVIDIA A100-SXM... On | 00000000:0F:00.0 Off >>> | 0 | >>> | N/A 28C P0 51W / 400W | 0MiB / 40536MiB | 0% Default >>> | >>> | | | Disabled | >>> >>> +-------------------------------+----------------------+----------------------+ >>> | 2 NVIDIA A100-SXM... On | 00000000:47:00.0 Off >>> | 0 | >>> | N/A 28C P0 52W / 400W | 0MiB / 40536MiB | 0% Default >>> | >>> | | | Disabled | >>> >>> +-------------------------------+----------------------+----------------------+ >>> | 3 NVIDIA A100-SXM... On | 00000000:4E:00.0 Off >>> | 0 | >>> | N/A 29C P0 54W / 400W | 0MiB / 40536MiB | 0% Default >>> | >>> | | | Disabled | >>> >>> +-------------------------------+----------------------+----------------------+ >>> | 4 NVIDIA A100-SXM... On | 00000000:87:00.0 Off >>> | 0 | >>> | N/A 34C P0 57W / 400W | 0MiB / 40536MiB | 0% Default >>> | >>> | | | Disabled | >>> >>> +-------------------------------+----------------------+----------------------+ >>> | 5 NVIDIA A100-SXM... On | 00000000:90:00.0 Off >>> | 0 | >>> | N/A 31C P0 55W / 400W | 0MiB / 40536MiB | 0% Default >>> | >>> | | | Disabled | >>> >>> +-------------------------------+----------------------+----------------------+ >>> | 6 NVIDIA A100-SXM... On | 00000000:B7:00.0 Off >>> | 0 | >>> | N/A 31C P0 51W / 400W | 0MiB / 40536MiB | 0% Default >>> | >>> | | | Disabled | >>> >>> +-------------------------------+----------------------+----------------------+ >>> | 7 NVIDIA A100-SXM... On | 00000000:BD:00.0 Off >>> | 0 | >>> | N/A 32C P0 52W / 400W | 0MiB / 40536MiB | 0% Default >>> | >>> | | | Disabled | >>> >>> +-------------------------------+----------------------+----------------------+ >>> >>> >>> +-----------------------------------------------------------------------------+ >>> | Processes: | >>> | GPU GI CI PID Type Process name GPU Memory | >>> | ID ID Usage | >>> >>> |=============================================================================| >>> | No running processes >>> found | >>> >>> +-----------------------------------------------------------------------------+ >>> >>> >>>