Re: [slurm-users] how to locate the problem when slurm failed to restrict gpu usage of user jobs

Tina Friedrich Wed, 23 Mar 2022 08:11:18 -0700

What does your cgroup.conf look like on the GPU nodes? (I don't thinkit's possible to make it so it's properly not visible without usingcgroups restrictions.)


Tina



On 23/03/2022 14:42, [email protected] wrote:

Hi, all:
We found a problem that slurm job with argument such as *--gres gpu:1*didn’t be restricted with gpu usage, user still can see all gpu card onallocated nodes.
Our gpu node has 4 cards with their gres.conf to be:
cat /etc/slurm/gres.conf
Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia0 CPUs=0-15

Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia1 CPUs=16-31

Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia2 CPUs=32-47

Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia3 CPUs=48-63

And for test, we submit simple job batch like:

#!/bin/bash

#SBATCH --job-name=test

#SBATCH --partition=a100

#SBATCH --nodes=1

#SBATCH --ntasks=6

#SBATCH --gres=gpu:1

#SBATCH --reservation="gpu test"

hostname

nvidia-smi

echo end
Then in the out file the nvidia-smi showed all 4 gpu cards. But weexpect to see only 1 allocated gpu card.
Official document of slurm said it will set *CUDA_VISIBLE_DEVICES *envvar to restrict the gpu card available to user. But we didn’t find suchvariable exists in job environment. We only confirmed it do exist inprolog script environment by adding debug command “echo$CUDA_VISIBLE_DEVICES” to slurm prolog script.
So how do slurm co-operate with nvidia tools to make job user only seeits allocated gpu card? What is the requirement on nvidia gpu drivers,CUDA toolkit or any other part to help slurm correctly restrict the gpuusage?


--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator

Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk

Re: [slurm-users] how to locate the problem when slurm failed to restrict gpu usage of user jobs

Reply via email to