You need to add ConstrainDevices=yes
To your cgroup.conf and restart slurmd on your nodes. This is the setting which gives access to only the GRES you request in the jobs Sean ________________________________ From: lyz--- via slurm-users <slurm-users@lists.schedmd.com> Sent: Tuesday, April 15, 2025 8:29:41 PM To: slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com> Subject: [EXT] [slurm-users] Re: Issue with Enforcing GPU Usage Limits in Slurm External email: Please exercise caution Hi, Christopher. Thank you for your reply. I have already modified the cgroup.conf configuration file in Slurm as follows: vim /etc/slurm/cgroup.conf # # Slurm cgroup support configuration file # # See man slurm.conf and man cgroup.conf for further # information on cgroup configuration parameters # CgroupAutomount=yes ConstrainCores=yes ConstrainRAMSpace=yes Then I edited slurm.conf: vim /etc/slurm/slurm.conf PrologFlags=CONTAIN TaskPlugin=task/cgroup ProctrackType=proctrack/cgroup I restarted both the slurmctld service on the head node and the slurmd service on the compute nodes. I also set resource limits for the user: [root@head1 ~]# sacctmgr show assoc format=cluster,account%35,user%35,partition,maxtres%35,GrpCPUs,GrpMem Cluster Account User Partition MaxTRES GrpCPUs GrpMem ---------- ----------------------------------- ----------------------------------- ---------- ----------------------------------- -------- ------- cluster lyz cluster lyz lyz gpus=2 80 However, when I specify CUDA device numbers in my .py script, for example: import os os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3" def test_gpu(): if torch.cuda.is_available(): torch.cuda.set_device(4) print("CUDA is available. PyTorch can use GPU.") num_gpus = torch.cuda.device_count() print(f"Number of GPUs available: {num_gpus}") current_device = torch.cuda.current_device() print(f"Current GPU device: {current_device}") device_name = torch.cuda.get_device_name(current_device) print(f"Name of the current GPU device: {device_name}") x = torch.rand(5, 5).cuda() print("Random tensor on GPU:") print(x) else: print("CUDA is not available. PyTorch will use CPU.") time.sleep(1000) if __name__ == "__main__": test_gpu() When I run this script, it still bypasses the resource restrictions set by cgroup. Are there any other ways to solve this problem? -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com