Hi everyone, I'm in charge of the new cluster of GPU in my lab.
I'm using cgroup to restrict access to ressources, especially GPUs. It works fine when user use the connection created by slurm. I am using the pam_slurm_adopt.so module to give ssh access to a node if the user already has a job running on it. When connecting to the node threw ssh, the user can see and use all the GPUs of the node, even if he asked for just one. This is really problematic as most user use the cluster by connecting their IDE with ssh to the cluster. I can't find any related ressources on the internet and in the old mails, do you have any idea what I am missing? I'm not an expert, and working in the system administration for 5 month... Thanks in advance, Guillaume Notes: I have slurm 21.08.5 The gateway (slurmctld) is running on Ubuntu 22.04 and the nodes under Fedora 36. Here is my slurm.conf: # slurm.conf file generated by configurator easy.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ClusterName=gpucluster SlurmctldHost=gpu-gw # MailProg=/bin/mail MpiDefault=none #MpiParams=ports=#-# PrologFlags=contain ProctrackType=proctrack/cgroup ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid #SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid #SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm #SlurmdUser=root StateSaveLocation=/var/spool/slurmctld SwitchType=switch/none TaskPlugin=task/cgroup # # # TIMERS KillWait=30 # # # SCHEDULING SchedulerType=sched/backfill SelectType=select/cons_tres SelectTypeParameters=CR_Core # # PRIORITY # Activate the multi factor priority plugin PriorityType=priority/multifactor # Reset usage after 1 week PriorityUsageResetPeriod=WEEKLY #apply no decay PriorityDecayHalfLife=0 # The smaller the job, the greater its job size priority. PriorityFavorSmall=YES # The job's age factor reaches 1.0 after waiting in the queue for a weeks. PriorityMaxAge=7-0 # This next group determines the weighting of each of the # components of the Multifactor Job Priority Plugin. # The default value for each of the following is 1. PriorityWeightAge=1000 PriorityWeightFairshare=10000 PriorityWeightJobSize=1000 PriorityWeightPartition=1000 PriorityWeightQOS=0 # don't use the qos factor # # # #MEMORY MAX AND DEFAULT VALUES (Mo) DefCpuPerGPU=2 # # # LOGGING AND ACCOUNTING AccountingStorageType=accounting_storage/slurmdbd AccountingStorageTRES=gres/gpu AccountingStoreFlags=job_comment AccountingStoragePort=6819 JobAcctGatherType=jobacct_gather/cgroup SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdLogFile=/var/log/slurm/slurmd.log # PREEMPTION PreemptMode=Requeue PreemptType=preempt/qos # # COMPUTE NODES GresTypes=gpu NodeName=node0[1-7] CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=386683 Gres=gpu:3 Features="nvidia,ampere,A100,pcie" State=UNKNOWN NodeName=nodemm01 CPUs=16 Boards=1 SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=16000 Gres=gpu:4 Features="nvidia,GeForce,RTX3090" State=UNKNOWN PartitionName=ids Nodes=node0[1-7] Default=YES MaxTime=INFINITE AllowQos=normal,default,preempt State=UP PartitionName=mm Nodes=nodemm01 MaxTime=INFINITE State=UP Here is my cgroup.conf: ### # Slurm cgroup support configuration file. ### CgroupAutomount=yes #CgroupMountpoint=/sys/fs/cgroup ConstrainCores=yes ConstrainDevices=yes #ConstrainKmemSpace=no #avoid known Kernel issues ConstrainRAMSpace=yes ConstrainSwapSpace=yes [ https://www.telecom-paris.fr/ ] Guillaume LECHANTRE Ingénieur de Recherche et Développement - 19 place Marguerite Perey CS 20031 91123 Palaiseau Cedex [ https://www.telecom-paris.fr/ ] [ https://twitter.com/TelecomParis_ ] [ https://www.facebook.com/TelecomParis ] [ https://www.linkedin.com/school/telecom-paris/ ] [ https://www.instagram.com/telecom_paris/ ] [ https://imtech.wp.imt.fr/ ] Une école de [ https://www.imt.fr/ | l'IMT ]