As a follow-up, we did figure out that if we set the partition to not be exclusive we get something that seems more reasonable.
That is to say that if I use a partition like this PartitionName=dlt_shared Nodes=dlt[01-12] Default=NO Shared=YES MaxTime=4-00:00:00 State=UP DefaultTime=8:00:00 with "shared=yes" then both sbatch and srun produce the expected results of returning the correct value of CUDA_VISIBLE_DEVICES based on what I ask for. It appears I should be switching to using Oversubscribe= instead of Shared= so I will play with that when I can, but I still don't understand how with "shared=exclusive" srun gives one result and sbatch gives another. Tim On Wed, May 19, 2021 at 11:26 AM Tim Carlson <tim.s.carl...@gmail.com> wrote: > Hey folks, > > Here is my setup: > > slurm-20.11.4 on x86_64 running Centos 7.x with CUDA 11.1 > > The relevant parts of the slurm.conf and a particular gres.conf file are: > > SelectType=select/cons_res > > SelectTypeParameters=CR_Core > > PriorityType=priority/multifactor > > GresTypes=gpu > > > NodeName=dlt[01-12] Gres=gpu:8 Feature=rtx Procs=40 State=UNKNOWN > > PartitionName=dlt Nodes=dlt[01-12] Default=NO Shared=Exclusive > MaxTime=4-00:00:00 State=UP DefaultTime=8:00:00 > > > And the gres.conf file for those nodes > > > [root@dlt02 ~]# more /etc/slurm/gres.conf > > Name=gpu File=/dev/nvidia0 > > Name=gpu File=/dev/nvidia1 > > Name=gpu File=/dev/nvidia2 > > Name=gpu File=/dev/nvidia3 > > Name=gpu File=/dev/nvidia4 > > Name=gpu File=/dev/nvidia5 > > Name=gpu File=/dev/nvidia6 > > Name=gpu File=/dev/nvidia7 > > > Now for the weird part. Srun works as expected and gives me a single GPU > > > [tim@rc-admin01 ~]$ srun -p dlt -N 1 -w dlt02 --gres=gpu:1 -A ops --pty > -u /bin/bash > > [tim@dlt02 ~]$ env | grep CUDA > > *CUDA*_VISIBLE_DEVICES=0 > > > If I submit basically the same thing with sbatch > > > [tim@rc-admin01 ~]$ cat sbatch.test > > #!/bin/bash > > #SBATCH -N 1 > > #SBATCH -A ops > > #SBATCH -t 10 > > #SBATCH -p dlt > > #SBATCH --gres=gpu:1 > > #SBATCH -w dlt02 > > env | grep CUDA > > > I get the following output. > > > [tim@rc-admin01 ~]$ cat slurm-28824.out > > CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 > > > > Any ideas of what is going on here? > > > Thanks in advance! This one has me stumped. > ReplyForward >