Confirmed that adding just the "Gres=" bit in slurm.conf works. That's what I get for reading the documentation too fast... thanks all!
~~ bnacar On Wed, 1 Dec 2021 14:05:09 +0100 Quirin Lohr <quirin.l...@in.tum.de> wrote: > Hi, > > you still need to specify the gpus in the node definition in slurm.conf. > At least the number, perhaps even the type reported by nvml must match > the node definition. (Gres=gpu:geforce_gtx_1080:4) > > I think the error message can be ignored, the 1080 just does not support > this feature. > > > Am 30.11.2021 um 16:12 schrieb Benjamin Nacar: > > Hi, > > > > We're trying to use Slurm's built-in Nvidia GPU detection mechanism to > > avoid having to specify GPUs explicitly in slurm.conf and gres.conf. We're > > running Debian 11, and the version of Slurm available for Debian 11 is > > 20.11. However, the version of Slurm in the standard debian repositories > > was apparently not compiled on a system with the necessary Nvidia library > > installed, so we recompiled Slurm 20.11 from the Debian source package with > > no modifications. > > > > With AutoDetect=nvml in gres.conf and GresTypes=gpu in slurm.conf, this is > > what we see on a 4-GPU host after restarting slurmd: > > > > [2021-11-29T15:49:58.226] Node reconfigured socket/core boundaries > > SocketsPerBoard=12:2(hw) CoresPerSocket=1:6(hw) > > [2021-11-29T15:50:02.397] error: _nvml_get_mem_freqs: Failed to get > > supported memory frequencies for the GPU : Not Supported > > [2021-11-29T15:50:02.398] error: _nvml_get_mem_freqs: Failed to get > > supported memory frequencies for the GPU : Not Supported > > [2021-11-29T15:50:02.398] error: _nvml_get_mem_freqs: Failed to get > > supported memory frequencies for the GPU : Not Supported > > [2021-11-29T15:50:02.399] error: _nvml_get_mem_freqs: Failed to get > > supported memory frequencies for the GPU : Not Supported > > [2021-11-29T15:50:02.551] gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system > > device(s) detected > > [2021-11-29T15:50:02.551] gres/gpu: _normalize_gres_conf: WARNING: The > > following autodetected GPUs are being ignored: > > [2021-11-29T15:50:02.551] GRES[gpu] Type:geforce_gtx_1080 Count:1 > > Cores(12):0-11 Links:0,0,0,-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia3 > > [2021-11-29T15:50:02.551] GRES[gpu] Type:geforce_gtx_1080 Count:1 > > Cores(12):0-11 Links:0,0,-1,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia2 > > [2021-11-29T15:50:02.551] GRES[gpu] Type:geforce_gtx_1080 Count:1 > > Cores(12):0-11 Links:0,-1,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia1 > > [2021-11-29T15:50:02.551] GRES[gpu] Type:geforce_gtx_1080 Count:1 > > Cores(12):0-11 Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0 > > [2021-11-29T15:50:02.614] slurmd version 20.11.4 started > > [2021-11-29T15:50:02.630] slurmd started on Mon, 29 Nov 2021 15:50:02 -0500 > > [2021-11-29T15:50:02.630] CPUs=12 Boards=1 Sockets=2 Cores=6 Threads=1 > > Memory=257840 TmpDisk=3951 Uptime=975072 CPUSpecList=(null) > > FeaturesAvail=(null) FeaturesActive=(null) > > > > Doing an "scontrol show node" for this host displays "Gres=(null)", and any > > attempts to submit a job with --gpus=1 results in "srun: error: Unable to > > allocate resources: Requested node configuration is not available". > > > > Any idea what might be wrong? > > > > Thanks, > > ~~ bnacar > > > > -- > Quirin Lohr > Systemadministration > Technische Universität München > Fakultät für Informatik > Lehrstuhl für Bildverarbeitung und Künstliche Intelligenz > > Boltzmannstrasse 3 > 85748 Garching > > Tel. +49 89 289 17769 > Fax +49 89 289 17757 > > quirin.l...@in.tum.de > www.vision.in.tum.de -- Benjamin Nacar Systems Programmer Computer Science Department Brown University 401.863.7621