Thank you very much!

Those were the missing settings!

I am not sure how I overlooked it for nearly two days, but I am happy that its working now.

Cheers
Dominik Baack


Am 27.10.2022 um 19:23 schrieb Sean Maxwell:
It looks like you are missing some of the slurm.conf entries related to enforcing the cgroup restrictions. I would go through the list here and verify/adjust your configuration:

https://slurm.schedmd.com/cgroup.conf.html#OPT_/etc/slurm/slurm.conf

Best,

-Sean



On Thu, Oct 27, 2022 at 1:04 PM Dominik Baack <dominik.ba...@cs.uni-dortmund.de> wrote:

    Hi,

    yes ContrainDevices is set:

    ###
    # Slurm cgroup support configuration file
    ###
    CgroupAutomount=yes
    #
    #CgroupMountpoint="/sys/fs/cgroup"
    ConstrainCores=yes
    ConstrainDevices=yes
    ConstrainRAMSpace=yes
    #
    #

    I attached the slurm configuration file as well

    Cheers
    Dominik

    Am 27.10.2022 um 17:57 schrieb Sean Maxwell:
    Hi Dominik,

    Do you have ConstrainDevices=yes set in your cgroup.conf?

    Best,

    -Sean

    On Thu, Oct 27, 2022 at 11:49 AM Dominik Baack
    <dominik.ba...@cs.uni-dortmund.de> wrote:

        Hi,

        We are in the process of setting up SLURM on some DGX A100
        nodes . We
        are experiencing the problem that all GPUs are available for
        users, even
        for jobs where only one should be assigned.

        It seems the requirement is forwarded correctly to the node,
        at least
        CUDA_VISIBLE_DEVICES is set to the correct id only discarded
        by the rest
        of the system.

        Cheers
        Dominik Baack

        Example:

        baack@gwkilab:~$ srun --gpus=1 nvidia-smi
        Thu Oct 27 17:39:04 2022
        
+-----------------------------------------------------------------------------+
        | NVIDIA-SMI 470.141.03   Driver Version: 470.141.03 CUDA
        Version:
        11.4     |
        
|-------------------------------+----------------------+----------------------+
        | GPU  Name        Persistence-M| Bus-Id        Disp.A |
        Volatile
        Uncorr. ECC |
        | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage |
        GPU-Util
        Compute M. |
        |                               | |               MIG M. |
        
|===============================+======================+======================|
        |   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off
        |                    0 |
        | N/A   28C    P0    52W / 400W |      0MiB / 40536MiB |
        0%      Default |
        |                               | | Disabled |
        
+-------------------------------+----------------------+----------------------+
        |   1  NVIDIA A100-SXM...  On   | 00000000:0F:00.0 Off
        |                    0 |
        | N/A   28C    P0    51W / 400W |      0MiB / 40536MiB |
        0%      Default |
        |                               | | Disabled |
        
+-------------------------------+----------------------+----------------------+
        |   2  NVIDIA A100-SXM...  On   | 00000000:47:00.0 Off
        |                    0 |
        | N/A   28C    P0    52W / 400W |      0MiB / 40536MiB |
        0%      Default |
        |                               | | Disabled |
        
+-------------------------------+----------------------+----------------------+
        |   3  NVIDIA A100-SXM...  On   | 00000000:4E:00.0 Off
        |                    0 |
        | N/A   29C    P0    54W / 400W |      0MiB / 40536MiB |
        0%      Default |
        |                               | | Disabled |
        
+-------------------------------+----------------------+----------------------+
        |   4  NVIDIA A100-SXM...  On   | 00000000:87:00.0 Off
        |                    0 |
        | N/A   34C    P0    57W / 400W |      0MiB / 40536MiB |
        0%      Default |
        |                               | | Disabled |
        
+-------------------------------+----------------------+----------------------+
        |   5  NVIDIA A100-SXM...  On   | 00000000:90:00.0 Off
        |                    0 |
        | N/A   31C    P0    55W / 400W |      0MiB / 40536MiB |
        0%      Default |
        |                               | | Disabled |
        
+-------------------------------+----------------------+----------------------+
        |   6  NVIDIA A100-SXM...  On   | 00000000:B7:00.0 Off
        |                    0 |
        | N/A   31C    P0    51W / 400W |      0MiB / 40536MiB |
        0%      Default |
        |                               | | Disabled |
        
+-------------------------------+----------------------+----------------------+
        |   7  NVIDIA A100-SXM...  On   | 00000000:BD:00.0 Off
        |                    0 |
        | N/A   32C    P0    52W / 400W |      0MiB / 40536MiB |
        0%      Default |
        |                               | | Disabled |
        
+-------------------------------+----------------------+----------------------+

        
+-----------------------------------------------------------------------------+
        | Processes: |
        |  GPU   GI   CI        PID   Type   Process name GPU Memory |
        |        ID   ID Usage      |
        
|=============================================================================|
        |  No running processes
        found |
        
+-----------------------------------------------------------------------------+

Reply via email to