cgroups can control access to devices (e.g. /dev/nvidia0), which is how I understand it to work.
-Sean On Thu, Mar 24, 2022 at 4:27 AM <taleinterve...@sjtu.edu.cn> wrote: > Well, this is indeed the point. We didn’t set *ConstrainDevices=yes *in > cgroup.conf. After adding this, gpu restriction works as expected. > > But what is the relation between gpu restriction and cgroup? I never heard > that cgroup can limit gpu card usage. Isn’t it a feature of cuda or nvidia > driver? > > > > *发件人:* Sean Maxwell <s...@case.edu> > *发送时间:* 2022年3月23日 23:05 > *收件人:* Slurm User Community List <slurm-users@lists.schedmd.com> > *主题:* Re: [slurm-users] how to locate the problem when slurm failed to > restrict gpu usage of user jobs > > > > Hi, > > > > If you are using cgroups for task/process management, you should verify > that your /etc/slurm/cgroup.conf has the following line: > > > > ConstrainDevices=yes > > > > I'm not sure about the missing environment variable, but the absence of > the above in cgroup.conf is one way the GPU devices can be unconstrained in > the jobs. > > > > -Sean > > > > > > > > On Wed, Mar 23, 2022 at 10:46 AM <taleinterve...@sjtu.edu.cn> wrote: > > Hi, all: > > > > We found a problem that slurm job with argument such as *--gres gpu:1 * > didn’t be restricted with gpu usage, user still can see all gpu card on > allocated nodes. > > Our gpu node has 4 cards with their gres.conf to be: > > > cat /etc/slurm/gres.conf > > Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia0 CPUs=0-15 > > Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia1 CPUs=16-31 > > Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia2 CPUs=32-47 > > Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia3 CPUs=48-63 > > > > And for test, we submit simple job batch like: > > #!/bin/bash > > #SBATCH --job-name=test > > #SBATCH --partition=a100 > > #SBATCH --nodes=1 > > #SBATCH --ntasks=6 > > #SBATCH --gres=gpu:1 > > #SBATCH --reservation="gpu test" > > hostname > > nvidia-smi > > echo end > > > > Then in the out file the nvidia-smi showed all 4 gpu cards. But we expect > to see only 1 allocated gpu card. > > > > Official document of slurm said it will set *CUDA_VISIBLE_DEVICES *env > var to restrict the gpu card available to user. But we didn’t find such > variable exists in job environment. We only confirmed it do exist in prolog > script environment by adding debug command “echo $CUDA_VISIBLE_DEVICES” > to slurm prolog script. > > > > So how do slurm co-operate with nvidia tools to make job user only see its > allocated gpu card? What is the requirement on nvidia gpu drivers, CUDA > toolkit or any other part to help slurm correctly restrict the gpu usage? > >