Hello Everyone,

New here and very new to slurm and hopefully someone can shed some light on 
this for me.  I’m in the process of setting up a single node slurm environment 
with nvidia a100.  I keep getting the error We were configured to autodetect 
nvml functionality, but we weren't able to find that lib when Slurm was 
configured.  when trying to start slurmd.  When removing GresTypes=gpu from 
slurm.conf slurmd starts up fine and can queue up and run jobs.  Cuda toolkit 
is installed along with NVIDIA Management Library (NVML).  I went as far as 
removing slurm and reinstalling to see if it would pick it up.  No go.

OS Ubuntu 20.04,  slurm.conf GresTypes=gpu is added, gres.conf AutoDetect=nvml 
Name=gpu Type=a100 File=/dev/nvidia0 COREs=0,1

I’ve searched around and see that many others have run into this but I haven’t 
found a fix yet.  Any help would be greatly appreciated.

Thanks,

Mike


Reply via email to