Hi,

you still need to specify the gpus in the node definition in slurm.conf. At least the number, perhaps even the type reported by nvml must match the node definition. (Gres=gpu:geforce_gtx_1080:4)

I think the error message can be ignored, the 1080 just does not support this feature.


Am 30.11.2021 um 16:12 schrieb Benjamin Nacar:
Hi,

We're trying to use Slurm's built-in Nvidia GPU detection mechanism to avoid 
having to specify GPUs explicitly in slurm.conf and gres.conf. We're running 
Debian 11, and the version of Slurm available for Debian 11 is 20.11. However, 
the version of Slurm in the standard debian repositories was apparently not 
compiled on a system with the necessary Nvidia library installed, so we 
recompiled Slurm 20.11 from the Debian source package with no modifications.

With AutoDetect=nvml in gres.conf and GresTypes=gpu in slurm.conf, this is what 
we see on a 4-GPU host after restarting slurmd:

[2021-11-29T15:49:58.226] Node reconfigured socket/core boundaries 
SocketsPerBoard=12:2(hw) CoresPerSocket=1:6(hw)
[2021-11-29T15:50:02.397] error: _nvml_get_mem_freqs: Failed to get supported 
memory frequencies for the GPU : Not Supported
[2021-11-29T15:50:02.398] error: _nvml_get_mem_freqs: Failed to get supported 
memory frequencies for the GPU : Not Supported
[2021-11-29T15:50:02.398] error: _nvml_get_mem_freqs: Failed to get supported 
memory frequencies for the GPU : Not Supported
[2021-11-29T15:50:02.399] error: _nvml_get_mem_freqs: Failed to get supported 
memory frequencies for the GPU : Not Supported
[2021-11-29T15:50:02.551] gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system 
device(s) detected
[2021-11-29T15:50:02.551] gres/gpu: _normalize_gres_conf: WARNING: The 
following autodetected GPUs are being ignored:
[2021-11-29T15:50:02.551]     GRES[gpu] Type:geforce_gtx_1080 Count:1 
Cores(12):0-11  Links:0,0,0,-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia3
[2021-11-29T15:50:02.551]     GRES[gpu] Type:geforce_gtx_1080 Count:1 
Cores(12):0-11  Links:0,0,-1,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia2
[2021-11-29T15:50:02.551]     GRES[gpu] Type:geforce_gtx_1080 Count:1 
Cores(12):0-11  Links:0,-1,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia1
[2021-11-29T15:50:02.551]     GRES[gpu] Type:geforce_gtx_1080 Count:1 
Cores(12):0-11  Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
[2021-11-29T15:50:02.614] slurmd version 20.11.4 started
[2021-11-29T15:50:02.630] slurmd started on Mon, 29 Nov 2021 15:50:02 -0500
[2021-11-29T15:50:02.630] CPUs=12 Boards=1 Sockets=2 Cores=6 Threads=1 
Memory=257840 TmpDisk=3951 Uptime=975072 CPUSpecList=(null) 
FeaturesAvail=(null) FeaturesActive=(null)

Doing an "scontrol show node" for this host displays "Gres=(null)", and any attempts to 
submit a job with --gpus=1 results in "srun: error: Unable to allocate resources: Requested node 
configuration is not available".

Any idea what might be wrong?

Thanks,
~~ bnacar


--
Quirin Lohr
Systemadministration
Technische Universität München
Fakultät für Informatik
Lehrstuhl für Bildverarbeitung und Künstliche Intelligenz

Boltzmannstrasse 3
85748 Garching

Tel. +49 89 289 17769
Fax +49 89 289 17757

quirin.l...@in.tum.de
www.vision.in.tum.de

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to