Hi all, Apologies for writing something misleading in the last mail. I missed your error message.
Rob was correct - your slurmd appears not to have the NVML flag on compile time. You need to set up the NVML and turn the --with-nvml flag on when configuring slurm to fix the issue if you are compiling one, or find a binary package that complied with such flag on. Credit to Rob - WE ARE S. Zhang 2023年11月30日(木) 23:30 Groner, Rob <rug...@psu.edu>: > Did you have --with-nvml as part of your configuration? Go back to your > config.log and verify that it ever said it found nvml.h. > > If not, then you'll need to make sure you have the right nvidia/cuda > packages installed on the host you're building slurm on, and you might have > to specify --with-nvml=<path to nvml install> if it's not in a standard > location. > > Rob > > ------------------------------ > *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of > Ravi Konila <ravibh...@gmail.com> > *Sent:* Thursday, November 30, 2023 9:06 AM > *To:* slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com> > *Subject:* [slurm-users] Autodetect of nvml is not working in gres.conf > > You don't often get email from ravibh...@gmail.com. Learn why this is > important <https://aka.ms/LearnAboutSenderIdentification> > Hello, > > My gres.conf has AutoDetect=nvml > when I restart slurmd service I do get > > *fatal: We were configured to autodetect nvml functionality, but we > weren't able to find that lib when Slurm was configured.* > > Referred few links to solve along with slurm-users email archives but > could not understand much. > > Can someone help me with this one. I am using DGX A100 Server which has 4 > numbers of A100 80GB GPUs. > > With Warm Regards > Ravi Konila >