It should exist in the user environment as well.
I would check the users .bashrc and .bash_profile settings to see if
they are doing anything that will change that.
Brian Andrus
On 3/23/2022 7:42 AM, taleinterve...@sjtu.edu.cn wrote:
Hi, all:
We found a problem that slurm job with argument such as *--gres gpu:1
*didn’t be restricted with gpu usage, user still can see all gpu card
on allocated nodes.
Our gpu node has 4 cards with their gres.conf to be:
> cat /etc/slurm/gres.conf
Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia0 CPUs=0-15
Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia1 CPUs=16-31
Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia2 CPUs=32-47
Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia3 CPUs=48-63
And for test, we submit simple job batch like:
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --partition=a100
#SBATCH --nodes=1
#SBATCH --ntasks=6
#SBATCH --gres=gpu:1
#SBATCH --reservation="gpu test"
hostname
nvidia-smi
echo end
Then in the out file the nvidia-smi showed all 4 gpu cards. But we
expect to see only 1 allocated gpu card.
Official document of slurm said it will set *CUDA_VISIBLE_DEVICES *env
var to restrict the gpu card available to user. But we didn’t find
such variable exists in job environment. We only confirmed it do exist
in prolog script environment by adding debug command “echo
$CUDA_VISIBLE_DEVICES” to slurm prolog script.
So how do slurm co-operate with nvidia tools to make job user only see
its allocated gpu card? What is the requirement on nvidia gpu drivers,
CUDA toolkit or any other part to help slurm correctly restrict the
gpu usage?