[slurm-users] 答复: how to locate the problem when slurm failed to restrict gpu usage of user jobs

taleintervenor Thu, 24 Mar 2022 01:29:22 -0700

Well, this is indeed the point. We didn’t set ConstrainDevices=yes in 
cgroup.conf. After adding this, gpu restriction works as expected.


But what is the relation between gpu restriction and cgroup? I never heard that 
cgroup can limit gpu card usage. Isn’t it a feature of cuda or nvidia driver? 

 

发件人: Sean Maxwell <s...@case.edu> 
发送时间: 2022年3月23日 23:05
收件人: Slurm User Community List <slurm-users@lists.schedmd.com>
主题: Re: [slurm-users] how to locate the problem when slurm failed to restrict 
gpu usage of user jobs

 

Hi,

 

If you are using cgroups for task/process management, you should verify that 
your /etc/slurm/cgroup.conf has the following line:

 

ConstrainDevices=yes

 

I'm not sure about the missing environment variable, but the absence of the 
above in cgroup.conf is one way the GPU devices can be unconstrained in the 
jobs.

 

-Sean

 

 

 

On Wed, Mar 23, 2022 at 10:46 AM <taleinterve...@sjtu.edu.cn 
<mailto:taleinterve...@sjtu.edu.cn> > wrote:

Hi, all:

 

We found a problem that slurm job with argument such as --gres gpu:1 didn’t be 
restricted with gpu usage, user still can see all gpu card on allocated nodes.

Our gpu node has 4 cards with their gres.conf to be:

> cat /etc/slurm/gres.conf

Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia0 CPUs=0-15

Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia1 CPUs=16-31

Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia2 CPUs=32-47

Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia3 CPUs=48-63

 

And for test, we submit simple job batch like:

#!/bin/bash

#SBATCH --job-name=test

#SBATCH --partition=a100

#SBATCH --nodes=1

#SBATCH --ntasks=6

#SBATCH --gres=gpu:1

#SBATCH --reservation="gpu test"

hostname

nvidia-smi

echo end

 

Then in the out file the nvidia-smi showed all 4 gpu cards. But we expect to 
see only 1 allocated gpu card.

 

Official document of slurm said it will set CUDA_VISIBLE_DEVICES env var to 
restrict the gpu card available to user. But we didn’t find such variable 
exists in job environment. We only confirmed it do exist in prolog script 
environment by adding debug command “echo $CUDA_VISIBLE_DEVICES” to slurm 
prolog script.

 

So how do slurm co-operate with nvidia tools to make job user only see its 
allocated gpu card? What is the requirement on nvidia gpu drivers, CUDA toolkit 
or any other part to help slurm correctly restrict the gpu usage?

[slurm-users] 答复: how to locate the problem when slurm failed to restrict gpu usage of user jobs

Reply via email to