Hi all, If you could offer a little bit more details on your OS and Slurm version that might shed some light.
There is an interesting detail about the NVML package if you are using RHEL-like OS. The NVML detection part of the slurm library (/usr/lib64/slurm/gpu_nvml.so) is linked against the /lib64/libnvidia-ml.so.1 to do the actual detection. If you do a simple nvidia driver installation that pulls in nvidia-driver-NVML from cuda-rhel8-x86_64 repository, this package would install /lib64/libnvidia-ml.so.1 as a symlink to /lib64/libnvidia-ml.so.<your driver version>. In this setup, as the linked library is present, the code would not crash. However, interestingly the package mentioned above missed another symlink: the /lib64/libnvidia-ml.so to /lib64/libnvidia-ml.so.<your driver version>. Take a look at the following line of the Slurm source code (I just used the master branch but git blame says it comes a long way): """ if (!dlopen("libnvidia-ml.so", RTLD_NOW | RTLD_GLOBAL)) """ Link to source code: https://github.com/SchedMD/slurm/blob/master/src/interfaces/gpu.c#L100 So even though the nvidia-driver-NVML is installed, and the system was able to find the linked library as it was linked against libnvidia-ml.so.1, as the libnvidia-ml.so link is not provided there, the dlopen fails for the file not found, thus the error message you posted follows. In our case, I just manually created the missing symlink by ln -s /lib64/libnvidia-ml.so.1 /lib64/libnvidia-ml.so, and the NVML worked as expected. I kind of wonder if such an issue arose from the packaging issue on the NVIDIA side, or if it should be filed as a bug of SLURM code only checking for the so library without any versioning suffix. Your case might be different, but I think as the error message is a direct result of slurm unable to find /lib64/libnvidia-ml.so, you should take a look at your setup to see if such so file is installed or not - if not, install the package, otherwise create the missing symlink. Sincerely, S. Zhang 2023年11月30日(木) 23:23 Ravi Konila <ravibh...@gmail.com>: > Hello, > > My gres.conf has AutoDetect=nvml > when I restart slurmd service I do get > > *fatal: We were configured to autodetect nvml functionality, but we > weren't able to find that lib when Slurm was configured.* > > Referred few links to solve along with slurm-users email archives but > could not understand much. > > Can someone help me with this one. I am using DGX A100 Server which has 4 > numbers of A100 80GB GPUs. > > With Warm Regards > Ravi Konila >