Thank you very much!
Those were the missing settings!
I am not sure how I overlooked it for nearly two days, but I am happy
that its working now.
Cheers
Dominik Baack
Am 27.10.2022 um 19:23 schrieb Sean Maxwell:
It looks like you are missing some of the slurm.conf entries related
to enforcing the cgroup restrictions. I would go through the list here
and verify/adjust your configuration:
https://slurm.schedmd.com/cgroup.conf.html#OPT_/etc/slurm/slurm.conf
Best,
-Sean
On Thu, Oct 27, 2022 at 1:04 PM Dominik Baack
<dominik.ba...@cs.uni-dortmund.de> wrote:
Hi,
yes ContrainDevices is set:
###
# Slurm cgroup support configuration file
###
CgroupAutomount=yes
#
#CgroupMountpoint="/sys/fs/cgroup"
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
#
#
I attached the slurm configuration file as well
Cheers
Dominik
Am 27.10.2022 um 17:57 schrieb Sean Maxwell:
Hi Dominik,
Do you have ConstrainDevices=yes set in your cgroup.conf?
Best,
-Sean
On Thu, Oct 27, 2022 at 11:49 AM Dominik Baack
<dominik.ba...@cs.uni-dortmund.de> wrote:
Hi,
We are in the process of setting up SLURM on some DGX A100
nodes . We
are experiencing the problem that all GPUs are available for
users, even
for jobs where only one should be assigned.
It seems the requirement is forwarded correctly to the node,
at least
CUDA_VISIBLE_DEVICES is set to the correct id only discarded
by the rest
of the system.
Cheers
Dominik Baack
Example:
baack@gwkilab:~$ srun --gpus=1 nvidia-smi
Thu Oct 27 17:39:04 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA
Version:
11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A |
Volatile
Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage |
GPU-Util
Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:07:00.0 Off
| 0 |
| N/A 28C P0 52W / 400W | 0MiB / 40536MiB |
0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... On | 00000000:0F:00.0 Off
| 0 |
| N/A 28C P0 51W / 400W | 0MiB / 40536MiB |
0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... On | 00000000:47:00.0 Off
| 0 |
| N/A 28C P0 52W / 400W | 0MiB / 40536MiB |
0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM... On | 00000000:4E:00.0 Off
| 0 |
| N/A 29C P0 54W / 400W | 0MiB / 40536MiB |
0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM... On | 00000000:87:00.0 Off
| 0 |
| N/A 34C P0 57W / 400W | 0MiB / 40536MiB |
0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM... On | 00000000:90:00.0 Off
| 0 |
| N/A 31C P0 55W / 400W | 0MiB / 40536MiB |
0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM... On | 00000000:B7:00.0 Off
| 0 |
| N/A 31C P0 51W / 400W | 0MiB / 40536MiB |
0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM... On | 00000000:BD:00.0 Off
| 0 |
| N/A 32C P0 52W / 400W | 0MiB / 40536MiB |
0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes
found |
+-----------------------------------------------------------------------------+