Confirmed that adding just the "Gres=" bit in slurm.conf works. That's what I 
get for reading the documentation too fast... thanks all!

~~ bnacar

On Wed, 1 Dec 2021 14:05:09 +0100
Quirin Lohr <quirin.l...@in.tum.de> wrote:

> Hi,
> 
> you still need to specify the gpus in the node definition in slurm.conf. 
> At least the number, perhaps even the type reported by nvml must match 
> the node definition. (Gres=gpu:geforce_gtx_1080:4)
> 
> I think the error message can be ignored, the 1080 just does not support 
> this feature.
> 
> 
> Am 30.11.2021 um 16:12 schrieb Benjamin Nacar:
> > Hi,
> > 
> > We're trying to use Slurm's built-in Nvidia GPU detection mechanism to 
> > avoid having to specify GPUs explicitly in slurm.conf and gres.conf. We're 
> > running Debian 11, and the version of Slurm available for Debian 11 is 
> > 20.11. However, the version of Slurm in the standard debian repositories 
> > was apparently not compiled on a system with the necessary Nvidia library 
> > installed, so we recompiled Slurm 20.11 from the Debian source package with 
> > no modifications.
> > 
> > With AutoDetect=nvml in gres.conf and GresTypes=gpu in slurm.conf, this is 
> > what we see on a 4-GPU host after restarting slurmd:
> > 
> > [2021-11-29T15:49:58.226] Node reconfigured socket/core boundaries 
> > SocketsPerBoard=12:2(hw) CoresPerSocket=1:6(hw)
> > [2021-11-29T15:50:02.397] error: _nvml_get_mem_freqs: Failed to get 
> > supported memory frequencies for the GPU : Not Supported
> > [2021-11-29T15:50:02.398] error: _nvml_get_mem_freqs: Failed to get 
> > supported memory frequencies for the GPU : Not Supported
> > [2021-11-29T15:50:02.398] error: _nvml_get_mem_freqs: Failed to get 
> > supported memory frequencies for the GPU : Not Supported
> > [2021-11-29T15:50:02.399] error: _nvml_get_mem_freqs: Failed to get 
> > supported memory frequencies for the GPU : Not Supported
> > [2021-11-29T15:50:02.551] gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system 
> > device(s) detected
> > [2021-11-29T15:50:02.551] gres/gpu: _normalize_gres_conf: WARNING: The 
> > following autodetected GPUs are being ignored:
> > [2021-11-29T15:50:02.551]     GRES[gpu] Type:geforce_gtx_1080 Count:1 
> > Cores(12):0-11  Links:0,0,0,-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia3
> > [2021-11-29T15:50:02.551]     GRES[gpu] Type:geforce_gtx_1080 Count:1 
> > Cores(12):0-11  Links:0,0,-1,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia2
> > [2021-11-29T15:50:02.551]     GRES[gpu] Type:geforce_gtx_1080 Count:1 
> > Cores(12):0-11  Links:0,-1,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia1
> > [2021-11-29T15:50:02.551]     GRES[gpu] Type:geforce_gtx_1080 Count:1 
> > Cores(12):0-11  Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
> > [2021-11-29T15:50:02.614] slurmd version 20.11.4 started
> > [2021-11-29T15:50:02.630] slurmd started on Mon, 29 Nov 2021 15:50:02 -0500
> > [2021-11-29T15:50:02.630] CPUs=12 Boards=1 Sockets=2 Cores=6 Threads=1 
> > Memory=257840 TmpDisk=3951 Uptime=975072 CPUSpecList=(null) 
> > FeaturesAvail=(null) FeaturesActive=(null)
> > 
> > Doing an "scontrol show node" for this host displays "Gres=(null)", and any 
> > attempts to submit a job with --gpus=1 results in "srun: error: Unable to 
> > allocate resources: Requested node configuration is not available".
> > 
> > Any idea what might be wrong?
> > 
> > Thanks,
> > ~~ bnacar
> > 
> 
> -- 
> Quirin Lohr
> Systemadministration
> Technische Universität München
> Fakultät für Informatik
> Lehrstuhl für Bildverarbeitung und Künstliche Intelligenz
> 
> Boltzmannstrasse 3
> 85748 Garching
> 
> Tel. +49 89 289 17769
> Fax +49 89 289 17757
> 
> quirin.l...@in.tum.de
> www.vision.in.tum.de

-- 
Benjamin Nacar
Systems Programmer
Computer Science Department
Brown University
401.863.7621

Reply via email to