I also compiled Slurm 20.11.8 to have GPU support in AlmaLinux 8.4 but don't have any problem with  NVML detecting our A100s.

¿Maybe the NVML library version used for Slurm compilation has to match the library version of the compute node where the GPU is?

Also, I see that you're using Geforce_GTX. ¿Could it be that NVML only supports Tesla GPUs?

This is my relevant Slurm configuration:

slurm.conf:

GresTypes=gpu,mps

NodeName=hpc-gpu[3-4].... Gres=gpu:A100:1

gres.conf:

NodeName=hpc-gpu[1-4] AutoDetect=nvml


and the NVIDIA part:

NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5


and this is what I see in the log:

[2021-12-01T09:29:45.675] debug:  CPUs:64 Boards:1 Sockets:2 CoresPerSocket:32 
ThreadsPerCore:1

[2021-12-01T09:29:45.675] debug:  gres/gpu: init: loaded

[2021-12-01T09:29:45.675] debug:  gres/mps: init: loaded

[2021-12-01T09:29:45.676] debug:  gpu/nvml: init: init: GPU NVML plugin loaded

[2021-12-01T09:29:46.298] debug2: gpu/nvml: _nvml_init: Successfully 
initialized NVML

[2021-12-01T09:29:46.298] debug:  gpu/nvml: _get_system_gpu_list_nvml: Systems 
Graphics Driver Version: 495.29.05

[2021-12-01T09:29:46.298] debug:  gpu/nvml: _get_system_gpu_list_nvml: NVML 
Library Version: 11.495.29.05

[2021-12-01T09:29:46.298] debug2: gpu/nvml: _get_system_gpu_list_nvml: Total 
CPU count: 64

[2021-12-01T09:29:46.298] debug2: gpu/nvml: _get_system_gpu_list_nvml: Device 
count: 1

[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml: GPU 
index 0:

[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml:     
Name: nvidia_a100-pcie-40gb

[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml:     
UUID: GPU-4cbb41e9-296b-ba72-d345-aa41fd7a8842

[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml:     PCI 
Domain/Bus/Device: 0:33:0

[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml:     PCI 
Bus ID: 00000000:21:00.0

[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml:     
NVLinks: -1

[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml:     
Device File (minor number): /dev/nvidia0

[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml:     CPU 
Affinity Range - Machine: 16-23

[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Core 
Affinity Range - Abstract: 16-23

[2021-12-01T09:29:46.365] debug2: Possible GPU Memory Frequencies (1):

[2021-12-01T09:29:46.365] debug2: -------------------------------

[2021-12-01T09:29:46.365] debug2:     *1215 MHz [0]

[2021-12-01T09:29:46.365] debug2:         Possible GPU Graphics Frequencies 
(81):

[2021-12-01T09:29:46.365] debug2:         ---------------------------------

[2021-12-01T09:29:46.365] debug2:           *1410 MHz [0]

[2021-12-01T09:29:46.365] debug2:           *1395 MHz [1]

[2021-12-01T09:29:46.365] debug2:           ...

[2021-12-01T09:29:46.365] debug2:           *810 MHz [40]

[2021-12-01T09:29:46.365] debug2:           ...

[2021-12-01T09:29:46.365] debug2:           *225 MHz [79]

[2021-12-01T09:29:46.365] debug2:           *210 MHz [80]

[2021-12-01T09:29:46.555] debug2: gpu/nvml: _nvml_shutdown: Successfully shut 
down NVML

[2021-12-01T09:29:46.555] gpu/nvml: _get_system_gpu_list_nvml: 1 GPU system 
device(s) detected

[2021-12-01T09:29:46.555] debug:  Gres GPU plugin: Normalizing gres.conf with 
system GPUs

[2021-12-01T09:29:46.555] debug2: gres/gpu: _normalize_gres_conf: 
gres_list_conf:

[2021-12-01T09:29:46.555] debug2:     GRES[gpu] Type:A100 Count:1 
Cores(64):(null)  Links:(null) Flags:HAS_TYPE File:(null)

[2021-12-01T09:29:46.556] debug:  gres/gpu: _normalize_gres_conf: Including the 
following GPU matched between system and configuration:

[2021-12-01T09:29:46.556] debug:      GRES[gpu] Type:A100 Count:1 
Cores(64):16-23  Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0

[2021-12-01T09:29:46.556] debug2: gres/gpu: _normalize_gres_conf: gres_list_gpu

[2021-12-01T09:29:46.556] debug2:     GRES[gpu] Type:A100 Count:1 
Cores(64):16-23  Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0

[2021-12-01T09:29:46.556] debug:  Gres GPU plugin: Final normalized gres.conf 
list:

[2021-12-01T09:29:46.556] debug:      GRES[gpu] Type:A100 Count:1 
Cores(64):16-23  Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0

[2021-12-01T09:29:46.556] debug:  Gres MPS plugin: Initalized gres.conf list:

[2021-12-01T09:29:46.556] debug:      GRES[gpu] Type:A100 Count:1 
Cores(64):16-23  Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0

[2021-12-01T09:29:46.556] debug:  Gres MPS plugin: Final gres.conf list:

[2021-12-01T09:29:46.556] debug:      GRES[gpu] Type:A100 Count:1 
Cores(64):16-23  Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0

[2021-12-01T09:29:46.556] Gres Name=gpu Type=A100 Count=1




Hope it helps.


El 30/11/21 a las 16:12, Benjamin Nacar escribió:
Hi,

We're trying to use Slurm's built-in Nvidia GPU detection mechanism to avoid 
having to specify GPUs explicitly in slurm.conf and gres.conf. We're running 
Debian 11, and the version of Slurm available for Debian 11 is 20.11. However, 
the version of Slurm in the standard debian repositories was apparently not 
compiled on a system with the necessary Nvidia library installed, so we 
recompiled Slurm 20.11 from the Debian source package with no modifications.

With AutoDetect=nvml in gres.conf and GresTypes=gpu in slurm.conf, this is what 
we see on a 4-GPU host after restarting slurmd:

[2021-11-29T15:49:58.226] Node reconfigured socket/core boundaries 
SocketsPerBoard=12:2(hw) CoresPerSocket=1:6(hw)
[2021-11-29T15:50:02.397] error: _nvml_get_mem_freqs: Failed to get supported 
memory frequencies for the GPU : Not Supported
[2021-11-29T15:50:02.398] error: _nvml_get_mem_freqs: Failed to get supported 
memory frequencies for the GPU : Not Supported
[2021-11-29T15:50:02.398] error: _nvml_get_mem_freqs: Failed to get supported 
memory frequencies for the GPU : Not Supported
[2021-11-29T15:50:02.399] error: _nvml_get_mem_freqs: Failed to get supported 
memory frequencies for the GPU : Not Supported
[2021-11-29T15:50:02.551] gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system 
device(s) detected
[2021-11-29T15:50:02.551] gres/gpu: _normalize_gres_conf: WARNING: The 
following autodetected GPUs are being ignored:
[2021-11-29T15:50:02.551]     GRES[gpu] Type:geforce_gtx_1080 Count:1 
Cores(12):0-11  Links:0,0,0,-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia3
[2021-11-29T15:50:02.551]     GRES[gpu] Type:geforce_gtx_1080 Count:1 
Cores(12):0-11  Links:0,0,-1,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia2
[2021-11-29T15:50:02.551]     GRES[gpu] Type:geforce_gtx_1080 Count:1 
Cores(12):0-11  Links:0,-1,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia1
[2021-11-29T15:50:02.551]     GRES[gpu] Type:geforce_gtx_1080 Count:1 
Cores(12):0-11  Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
[2021-11-29T15:50:02.614] slurmd version 20.11.4 started
[2021-11-29T15:50:02.630] slurmd started on Mon, 29 Nov 2021 15:50:02 -0500
[2021-11-29T15:50:02.630] CPUs=12 Boards=1 Sockets=2 Cores=6 Threads=1 
Memory=257840 TmpDisk=3951 Uptime=975072 CPUSpecList=(null) 
FeaturesAvail=(null) FeaturesActive=(null)

Doing an "scontrol show node" for this host displays "Gres=(null)", and any attempts to 
submit a job with --gpus=1 results in "srun: error: Unable to allocate resources: Requested node 
configuration is not available".

Any idea what might be wrong?

Thanks,
~~ bnacar

--
CiTIUS <http://citius.usc.es/> Fernando Guillén Camba <http://citius.usc.es/v/fernando.guillen>
Unidade de Xestión de Infraestruturas TIC
E-mail:fernando.guil...@usc.es <mailto:fernando.guil...@usc.es> · Phone:+34 881816409 Website: citius.usc.es <http://citius.usc.es> · Twitter: citiususc <http://twitter.com/citiususc>

Reply via email to