[slurm-users] Cgroup not restricting GPUs acces with ssh

Guillaume Lechantre Thu, 09 Feb 2023 06:11:27 -0800

Hi everyone, 

I'm in charge of the new cluster of GPU in my lab.


I'm using cgroup to restrict access to ressources, especially GPUs. 
It works fine when user use the connection created by slurm. 

I am using the pam_slurm_adopt.so module to give ssh access to a node if the 
user already has a job running on it. 
When connecting to the node threw ssh, the user can see and use all the GPUs of 
the node, even if he asked for just one. 
This is really problematic as most user use the cluster by connecting their IDE 
with ssh to the cluster. 

I can't find any related ressources on the internet and in the old mails, do 
you have any idea what I am missing? 
I'm not an expert, and working in the system administration for 5 month... 


Thanks in advance, 

Guillaume 


Notes: 

I have slurm 21.08.5 
The gateway (slurmctld) is running on Ubuntu 22.04 and the nodes under Fedora 
36. 


Here is my slurm.conf: 
# slurm.conf file generated by configurator easy.html. 
# Put this file on all nodes of your cluster. 
# See the slurm.conf man page for more information. 
# 
ClusterName=gpucluster 
SlurmctldHost=gpu-gw 
# 
MailProg=/bin/mail 
MpiDefault=none 
#MpiParams=ports=#-# 
PrologFlags=contain 
ProctrackType=proctrack/cgroup 
ReturnToService=1 
SlurmctldPidFile=/var/run/slurmctld.pid 
#SlurmctldPort=6817 
SlurmdPidFile=/var/run/slurmd.pid 
#SlurmdPort=6818 
SlurmdSpoolDir=/var/spool/slurmd 
SlurmUser=slurm 
#SlurmdUser=root 
StateSaveLocation=/var/spool/slurmctld 
SwitchType=switch/none 
TaskPlugin=task/cgroup 
# 
# 
# TIMERS 
KillWait=30 
# 
# 
# SCHEDULING 
SchedulerType=sched/backfill 
SelectType=select/cons_tres 
SelectTypeParameters=CR_Core 
# 
# PRIORITY 
# Activate the multi factor priority plugin 
PriorityType=priority/multifactor 
# Reset usage after 1 week 
PriorityUsageResetPeriod=WEEKLY 
#apply no decay 
PriorityDecayHalfLife=0 
# The smaller the job, the greater its job size priority. 
PriorityFavorSmall=YES 
# The job's age factor reaches 1.0 after waiting in the queue for a weeks. 
PriorityMaxAge=7-0 
# This next group determines the weighting of each of the 
# components of the Multifactor Job Priority Plugin. 
# The default value for each of the following is 1. 
PriorityWeightAge=1000 
PriorityWeightFairshare=10000 
PriorityWeightJobSize=1000 
PriorityWeightPartition=1000 
PriorityWeightQOS=0 # don't use the qos factor 
# 
# 
# 
#MEMORY MAX AND DEFAULT VALUES (Mo) 
DefCpuPerGPU=2 
# 
# 
# LOGGING AND ACCOUNTING 
AccountingStorageType=accounting_storage/slurmdbd 
AccountingStorageTRES=gres/gpu 
AccountingStoreFlags=job_comment 
AccountingStoragePort=6819 
JobAcctGatherType=jobacct_gather/cgroup 
SlurmctldLogFile=/var/log/slurm/slurmctld.log 
SlurmdLogFile=/var/log/slurm/slurmd.log 
# PREEMPTION 
PreemptMode=Requeue 
PreemptType=preempt/qos 
# 
# COMPUTE NODES 
GresTypes=gpu 
NodeName=node0[1-7] CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 
ThreadsPerCore=2 RealMemory=386683 Gres=gpu:3 
Features="nvidia,ampere,A100,pcie" State=UNKNOWN 
NodeName=nodemm01 CPUs=16 Boards=1 SocketsPerBoard=1 CoresPerSocket=8 
ThreadsPerCore=2 RealMemory=16000 Gres=gpu:4 Features="nvidia,GeForce,RTX3090" 
State=UNKNOWN 
PartitionName=ids Nodes=node0[1-7] Default=YES MaxTime=INFINITE 
AllowQos=normal,default,preempt State=UP 
PartitionName=mm Nodes=nodemm01 MaxTime=INFINITE State=UP 


Here is my cgroup.conf: 

### 
# Slurm cgroup support configuration file. 
### 
CgroupAutomount=yes 
#CgroupMountpoint=/sys/fs/cgroup 
ConstrainCores=yes 
ConstrainDevices=yes 
#ConstrainKmemSpace=no #avoid known Kernel issues 
ConstrainRAMSpace=yes 
ConstrainSwapSpace=yes 





        [ https://www.telecom-paris.fr/ ]       
Guillaume LECHANTRE 
Ingénieur de Recherche et Développement 
- 
19 place Marguerite Perey 
CS 20031 
91123 Palaiseau Cedex 
[ https://www.telecom-paris.fr/ ] [ https://twitter.com/TelecomParis_ ] [ 
https://www.facebook.com/TelecomParis ] [ 
https://www.linkedin.com/school/telecom-paris/ ] [ 
https://www.instagram.com/telecom_paris/ ] [ https://imtech.wp.imt.fr/ ] 
Une école de [ https://www.imt.fr/ | l'IMT ]

[slurm-users] Cgroup not restricting GPUs acces with ssh

Reply via email to