After a complete shutdown and restart of all daemons, things have changed
somewhat

# scontrol show nodes | egrep '(^Node|Gres)'
NodeName=mlscgpu1 Arch=x86_64 CoresPerSocket=16
   Gres=gpu:quadro_rtx_6000:10(S:0)
NodeName=mlscgpu2 Arch=x86_64 CoresPerSocket=16
   Gres=gpu:quadro_rtx_6000:5(S:0)

and I can submit like this

mlscgpu1[0]:~$ salloc -n1 -c3 -p batch --gres=gpu:quadro_rtx_6000:1 --mem=1G
salloc: Granted job allocation 16
mlscgpu1[0]:~$ printenv | grep -i CUDA
mlscgpu1[0]:~$ printenv | grep -i slurm
SLURM_NODELIST=mlscgpu1
SLURM_JOB_NAME=bash
SLURM_NODE_ALIASES=(null)
SLURM_JOB_QOS=normal
SLURM_NNODES=1
SLURM_JOBID=16
SLURM_NTASKS=1
SLURM_TASKS_PER_NODE=1
SLURM_CONF=/etc/slurm/slurm.conf
SLURM_CPUS_PER_TASK=3
SLURM_JOB_ID=16
SLURM_SUBMIT_DIR=/autofs/homes/011/raines
SLURM_NPROCS=1
SLURM_JOB_NODELIST=mlscgpu1
SLURM_CLUSTER_NAME=mlsc
SLURM_JOB_CPUS_PER_NODE=4
SLURM_SUBMIT_HOST=mlscgpu1
SLURM_JOB_PARTITION=batch
SLURM_JOB_NUM_NODES=1
SLURM_MEM_PER_NODE=1024
mlscgpu1[0]:~$

But still no CUDA_VISIBLE_DEVICES is being set



On Thu, 23 Jul 2020 10:32am, Paul Raines wrote:


I have two systems in my cluster with GPUs.  Their setup in slurm.conf is

GresTypes=gpu
NodeName=mlscgpu1 Gres=gpu:quadro_rtx_6000:10 CPUs=64 Boards=1
SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1546557
NodeName=mlscgpu2 Gres=gpu:quadro_rtx_6000:5 CPUs=64 Boards=1
 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1546557

My gres.conf is simply

AutoDetect=nvml

When I start slurmd on mlscgpu2 for example the log shows

[2020-07-23T10:05:10.619] 5 GPU system device(s) detected
[2020-07-23T10:05:10.619] Gres Name=gpu Type=quadro_rtx_6000 Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Cores=0-31 CoreCnt=64 Links=-1,0,2,0,0 [2020-07-23T10:05:10.619] Gres Name=gpu Type=quadro_rtx_6000 Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Cores=0-31 CoreCnt=64 Links=0,-1,0,0,0 [2020-07-23T10:05:10.619] Gres Name=gpu Type=quadro_rtx_6000 Count=1 Index=2 ID=7696487 File=/dev/nvidia2 Cores=0-31 CoreCnt=64 Links=2,0,-1,0,0 [2020-07-23T10:05:10.619] Gres Name=gpu Type=quadro_rtx_6000 Count=1 Index=3 ID=7696487 File=/dev/nvidia3 Cores=0-31 CoreCnt=64 Links=0,0,0,-1,2 [2020-07-23T10:05:10.619] Gres Name=gpu Type=quadro_rtx_6000 Count=1 Index=4 ID=7696487 File=/dev/nvidia4 Cores=0-31 CoreCnt=64 Links=0,0,0,2,-1
[2020-07-23T10:05:10.626] slurmd version 20.02.3 started
[2020-07-23T10:05:10.627] slurmd started on Thu, 23 Jul 2020 10:05:10 -0400
[2020-07-23T10:05:10.627] CPUs=64 Boards=1 Sockets=2 Cores=16 Threads=2 Memory=1546557 TmpDisk=215198 Uptime=1723215 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

# scontrol show nodes | egrep '(^Node|Gres)'
NodeName=mlscgpu1 Arch=x86_64 CoresPerSocket=16
  Gres=gpu:10(S:0)
NodeName=mlscgpu2 Arch=x86_64 CoresPerSocket=16
   Gres=gpu:5(S:0)

Note who Gres above does not show "quadro_rtx_6000" and what does the (S:0)
mean?

Doing a submit like this fails:

$ salloc -n1 -c3 -p batch --gres=gpu:quadro_rtx_6000:1
salloc: error: Job submit/allocate failed: Requested node configuration is not available
salloc: Job allocation 9 has been revoked.

This works but there is no CUDA device allocated in environment

$ salloc -n1 -c3 -p batch --gres=gpu:1
salloc: Granted job allocation 10
$ printenv | grep -i cuda
$


I have also tried changing gres.conf to and doing scontrol reconfigure

AutoDetect=nvml
Name=gpu Type=quadro_rtx_6000 File=/dev/nvidia0 Cores=0-31
Name=gpu Type=quadro_rtx_6000 File=/dev/nvidia1 Cores=0-31
Name=gpu Type=quadro_rtx_6000 File=/dev/nvidia2 Cores=0-31
Name=gpu Type=quadro_rtx_6000 File=/dev/nvidia3 Cores=0-31
Name=gpu Type=quadro_rtx_6000 File=/dev/nvidia4 Cores=0-31

But this made no difference.



Reply via email to