I also compiled Slurm 20.11.8 to have GPU support in AlmaLinux 8.4 but
don't have any problem with NVML detecting our A100s.
¿Maybe the NVML library version used for Slurm compilation has to match
the library version of the compute node where the GPU is?
Also, I see that you're using Geforce_GTX. ¿Could it be that NVML only
supports Tesla GPUs?
This is my relevant Slurm configuration:
slurm.conf:
GresTypes=gpu,mps
NodeName=hpc-gpu[3-4].... Gres=gpu:A100:1
gres.conf:
NodeName=hpc-gpu[1-4] AutoDetect=nvml
and the NVIDIA part:
NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5
and this is what I see in the log:
[2021-12-01T09:29:45.675] debug: CPUs:64 Boards:1 Sockets:2 CoresPerSocket:32
ThreadsPerCore:1
[2021-12-01T09:29:45.675] debug: gres/gpu: init: loaded
[2021-12-01T09:29:45.675] debug: gres/mps: init: loaded
[2021-12-01T09:29:45.676] debug: gpu/nvml: init: init: GPU NVML plugin loaded
[2021-12-01T09:29:46.298] debug2: gpu/nvml: _nvml_init: Successfully
initialized NVML
[2021-12-01T09:29:46.298] debug: gpu/nvml: _get_system_gpu_list_nvml: Systems
Graphics Driver Version: 495.29.05
[2021-12-01T09:29:46.298] debug: gpu/nvml: _get_system_gpu_list_nvml: NVML
Library Version: 11.495.29.05
[2021-12-01T09:29:46.298] debug2: gpu/nvml: _get_system_gpu_list_nvml: Total
CPU count: 64
[2021-12-01T09:29:46.298] debug2: gpu/nvml: _get_system_gpu_list_nvml: Device
count: 1
[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml: GPU
index 0:
[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml:
Name: nvidia_a100-pcie-40gb
[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml:
UUID: GPU-4cbb41e9-296b-ba72-d345-aa41fd7a8842
[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI
Domain/Bus/Device: 0:33:0
[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI
Bus ID: 00000000:21:00.0
[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml:
NVLinks: -1
[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml:
Device File (minor number): /dev/nvidia0
[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml: CPU
Affinity Range - Machine: 16-23
[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml: Core
Affinity Range - Abstract: 16-23
[2021-12-01T09:29:46.365] debug2: Possible GPU Memory Frequencies (1):
[2021-12-01T09:29:46.365] debug2: -------------------------------
[2021-12-01T09:29:46.365] debug2: *1215 MHz [0]
[2021-12-01T09:29:46.365] debug2: Possible GPU Graphics Frequencies
(81):
[2021-12-01T09:29:46.365] debug2: ---------------------------------
[2021-12-01T09:29:46.365] debug2: *1410 MHz [0]
[2021-12-01T09:29:46.365] debug2: *1395 MHz [1]
[2021-12-01T09:29:46.365] debug2: ...
[2021-12-01T09:29:46.365] debug2: *810 MHz [40]
[2021-12-01T09:29:46.365] debug2: ...
[2021-12-01T09:29:46.365] debug2: *225 MHz [79]
[2021-12-01T09:29:46.365] debug2: *210 MHz [80]
[2021-12-01T09:29:46.555] debug2: gpu/nvml: _nvml_shutdown: Successfully shut
down NVML
[2021-12-01T09:29:46.555] gpu/nvml: _get_system_gpu_list_nvml: 1 GPU system
device(s) detected
[2021-12-01T09:29:46.555] debug: Gres GPU plugin: Normalizing gres.conf with
system GPUs
[2021-12-01T09:29:46.555] debug2: gres/gpu: _normalize_gres_conf:
gres_list_conf:
[2021-12-01T09:29:46.555] debug2: GRES[gpu] Type:A100 Count:1
Cores(64):(null) Links:(null) Flags:HAS_TYPE File:(null)
[2021-12-01T09:29:46.556] debug: gres/gpu: _normalize_gres_conf: Including the
following GPU matched between system and configuration:
[2021-12-01T09:29:46.556] debug: GRES[gpu] Type:A100 Count:1
Cores(64):16-23 Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
[2021-12-01T09:29:46.556] debug2: gres/gpu: _normalize_gres_conf: gres_list_gpu
[2021-12-01T09:29:46.556] debug2: GRES[gpu] Type:A100 Count:1
Cores(64):16-23 Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
[2021-12-01T09:29:46.556] debug: Gres GPU plugin: Final normalized gres.conf
list:
[2021-12-01T09:29:46.556] debug: GRES[gpu] Type:A100 Count:1
Cores(64):16-23 Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
[2021-12-01T09:29:46.556] debug: Gres MPS plugin: Initalized gres.conf list:
[2021-12-01T09:29:46.556] debug: GRES[gpu] Type:A100 Count:1
Cores(64):16-23 Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
[2021-12-01T09:29:46.556] debug: Gres MPS plugin: Final gres.conf list:
[2021-12-01T09:29:46.556] debug: GRES[gpu] Type:A100 Count:1
Cores(64):16-23 Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
[2021-12-01T09:29:46.556] Gres Name=gpu Type=A100 Count=1
Hope it helps.
El 30/11/21 a las 16:12, Benjamin Nacar escribió:
Hi,
We're trying to use Slurm's built-in Nvidia GPU detection mechanism to avoid
having to specify GPUs explicitly in slurm.conf and gres.conf. We're running
Debian 11, and the version of Slurm available for Debian 11 is 20.11. However,
the version of Slurm in the standard debian repositories was apparently not
compiled on a system with the necessary Nvidia library installed, so we
recompiled Slurm 20.11 from the Debian source package with no modifications.
With AutoDetect=nvml in gres.conf and GresTypes=gpu in slurm.conf, this is what
we see on a 4-GPU host after restarting slurmd:
[2021-11-29T15:49:58.226] Node reconfigured socket/core boundaries
SocketsPerBoard=12:2(hw) CoresPerSocket=1:6(hw)
[2021-11-29T15:50:02.397] error: _nvml_get_mem_freqs: Failed to get supported
memory frequencies for the GPU : Not Supported
[2021-11-29T15:50:02.398] error: _nvml_get_mem_freqs: Failed to get supported
memory frequencies for the GPU : Not Supported
[2021-11-29T15:50:02.398] error: _nvml_get_mem_freqs: Failed to get supported
memory frequencies for the GPU : Not Supported
[2021-11-29T15:50:02.399] error: _nvml_get_mem_freqs: Failed to get supported
memory frequencies for the GPU : Not Supported
[2021-11-29T15:50:02.551] gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system
device(s) detected
[2021-11-29T15:50:02.551] gres/gpu: _normalize_gres_conf: WARNING: The
following autodetected GPUs are being ignored:
[2021-11-29T15:50:02.551] GRES[gpu] Type:geforce_gtx_1080 Count:1
Cores(12):0-11 Links:0,0,0,-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia3
[2021-11-29T15:50:02.551] GRES[gpu] Type:geforce_gtx_1080 Count:1
Cores(12):0-11 Links:0,0,-1,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia2
[2021-11-29T15:50:02.551] GRES[gpu] Type:geforce_gtx_1080 Count:1
Cores(12):0-11 Links:0,-1,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia1
[2021-11-29T15:50:02.551] GRES[gpu] Type:geforce_gtx_1080 Count:1
Cores(12):0-11 Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
[2021-11-29T15:50:02.614] slurmd version 20.11.4 started
[2021-11-29T15:50:02.630] slurmd started on Mon, 29 Nov 2021 15:50:02 -0500
[2021-11-29T15:50:02.630] CPUs=12 Boards=1 Sockets=2 Cores=6 Threads=1
Memory=257840 TmpDisk=3951 Uptime=975072 CPUSpecList=(null)
FeaturesAvail=(null) FeaturesActive=(null)
Doing an "scontrol show node" for this host displays "Gres=(null)", and any attempts to
submit a job with --gpus=1 results in "srun: error: Unable to allocate resources: Requested node
configuration is not available".
Any idea what might be wrong?
Thanks,
~~ bnacar
--
CiTIUS <http://citius.usc.es/> Fernando Guillén Camba
<http://citius.usc.es/v/fernando.guillen>
Unidade de Xestión de Infraestruturas TIC
E-mail:fernando.guil...@usc.es <mailto:fernando.guil...@usc.es> ·
Phone:+34 881816409
Website: citius.usc.es <http://citius.usc.es> · Twitter: citiususc
<http://twitter.com/citiususc>