Hi everyone, I have a slurm node named, mk-gpu-1, with eight GPUs which I've been testing sending GPU based container jobs to. For whatever reason, it will only run a single GPU at a time. All other SLURM sent GPU jobs have a pending (PD) state due to "(Resources)".
[ztang@mk-gpu-1 ~]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 523 gpu.q slurm-gp ztang PD 0:00 1 (Resources) 522 gpu.q slurm-gp bwong1 R 0:09 1 mk-gpu-1 Anyone know why this would happen? I'll try to provide the relevant portions of my configuration: *slurm.conf: * GresTypes=gpu AccountingStorageTres=gres/gpu DebugFlags=CPU_Bind,gres NodeName=mk-gpu-1 NodeAddr=10.10.100.106 RealMemory=500000 Gres=gpu:8 Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 State=UNKNOWN PartitionName=gpu.q Nodes=mk-gpu-1,mk-gpu-2,mk-gpu-3 Default=NO MaxTime=INFINITE State=UP *gres.conf* # This line is causing issues in Slurm 19.05 #AutoDetect=nvml NodeName=mk-gpu-1 Name=gpu File=/dev/nvidia[0-7] (I commented out AutoDetect=nvml because Slurm will not start properly and will output: "slurmd[28070]: fatal: We were configured to autodetect nvml functionality, but we weren't able to find that lib when Slurm was configured." Could use some help there too if possible. ) *cgroup.conf* CgroupAutomount=yes ConstrainCores=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes ConstrainDevices=yes *submission script:* #!/bin/bash #SBATCH -c 2 #SBATCH -o slurm-gpu-job.out #SBATCH -p gpu.q #SBATCH -w mk-gpu-1 #SBATCH --gres=gpu:1 srun singularity exec --nv docker://tensorflow/tensorflow:latest-gpu \ python ./models/tutorials/image/mnist/convolutional.py Thanks in advance for any ideas, Benjamin Wong