We're in the midst of transitioning our SGE cluster to slurm 20.02.6, running on up-to-date CentOS-7. We built RPMs from the standard tarball against CUDA 10.1. These RPMs worked just fine on our first GPU test node (with Tesla K80s) using "AutoDetect=nvml" in /etc/gres.conf. However, we just tried to add a second host with GTX 1080s in it. Running "slurmd -G" results in the following output:

slurmd: error: _nvml_get_mem_freqs: Failed to get supported memory frequencies slurmd: error: for the GPU : Not Supported slurmd: error: _nvml_get_mem_freqs: Failed to get supported memory frequencies slurmd: error: for the GPU : Not Supported slurmd: error: _nvml_get_mem_freqs: Failed to get supported memory frequencies slurmd: error: for the GPU : Not Supported slurmd: error: _nvml_get_mem_freqs: Failed to get supported memory frequencies slurmd: error: for the GPU : Not Supported
slurmd:  4 GPU system device(s) detected
slurmd:  WARNING: The following autodetected GPUs are being ignored:
slurmd: GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(1024):28-55 slurmd: Links:0,0,0,-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia3 slurmd: GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(1024):28-55 slurmd: Links:0,0,-1,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia2 slurmd: GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(1024):0-27 slurmd: Links:0,-1,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia1 slurmd: GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(1024):0-27 slurmd: Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0

My googling has utterly failed me on this.  Any help?  Thanks!

--
Joshua Baker-LePain
Wynton Cluster Sysadmin
UCSF


Reply via email to