You need to add

ConstrainDevices=yes

To your cgroup.conf and restart slurmd on your nodes. This is the setting which 
gives access to only the GRES you request in the jobs

Sean

________________________________
From: lyz--- via slurm-users <slurm-users@lists.schedmd.com>
Sent: Tuesday, April 15, 2025 8:29:41 PM
To: slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com>
Subject: [EXT] [slurm-users] Re: Issue with Enforcing GPU Usage Limits in Slurm

External email: Please exercise caution

Hi, Christopher. Thank you for your reply.

I have already modified the cgroup.conf configuration file in Slurm as follows:

vim /etc/slurm/cgroup.conf
#
# Slurm cgroup support configuration file
#
# See man slurm.conf and man cgroup.conf for further
# information on cgroup configuration parameters
#
CgroupAutomount=yes

ConstrainCores=yes
ConstrainRAMSpace=yes

Then I edited slurm.conf:

vim /etc/slurm/slurm.conf
PrologFlags=CONTAIN
TaskPlugin=task/cgroup
ProctrackType=proctrack/cgroup
I restarted both the slurmctld service on the head node and the slurmd service 
on the compute nodes.

I also set resource limits for the user:
[root@head1 ~]# sacctmgr show assoc 
format=cluster,account%35,user%35,partition,maxtres%35,GrpCPUs,GrpMem
   Cluster                             Account                                
User  Partition                             MaxTRES  GrpCPUs  GrpMem
---------- ----------------------------------- 
----------------------------------- ---------- 
----------------------------------- -------- -------
   cluster                                 lyz
   cluster                                 lyz                                 
lyz                                   gpus=2      80

However, when I specify CUDA device numbers in my .py script, for example:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"
def test_gpu():
    if torch.cuda.is_available():
       torch.cuda.set_device(4)
        print("CUDA is available. PyTorch can use GPU.")

        num_gpus = torch.cuda.device_count()
        print(f"Number of GPUs available: {num_gpus}")

        current_device = torch.cuda.current_device()
        print(f"Current GPU device: {current_device}")

        device_name = torch.cuda.get_device_name(current_device)
        print(f"Name of the current GPU device: {device_name}")

        x = torch.rand(5, 5).cuda()
        print("Random tensor on GPU:")
        print(x)
    else:
        print("CUDA is not available. PyTorch will use CPU.")
    time.sleep(1000)

if __name__ == "__main__":
    test_gpu()

When I run this script, it still bypasses the resource restrictions set by 
cgroup.

Are there any other ways to solve this problem?

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to