Thanks Paul!

On 26-01-2021 20:50, Paul Edmon wrote:
In our RPM spec we use to build slurm we do the following additional things for GPU's:

BuildRequires: cuda-nvml-devel-11-1

the in the %build section we do:

export CFLAGS="$CFLAGS -L/usr/local/cuda-11.1/targets/x86_64-linux/lib/stubs/ -I/usr/local/cuda-11.1/targets/x86_64-linux/include/"

That ensures the cuda libs are installed and it directs slurm to where they are.  After that configure should detect the nvml libs and link against them.

I've attached our full spec that we use to build.

What I don't understand is, is it actually *required* to make the NVIDIA libraries available to Slurm? I didn't do that, and I'm not aware of any problems with our GPU nodes so far. Of course, our GPU nodes have the libraries installed and the /dev/nvidia? devices are present.

Are some of Slurm's GPU features missing or broken without the libraries? SchedMD's slurm.spec file doesn't mention any "--with nvidia" (or similar) build options, so I'm really puzzled.

Most of our nodes don't have GPUs, so I wouldn't like to install libraries on those nodes needlessly.

Thanks,
Ole

On 1/26/2021 2:29 PM, Ole Holm Nielsen wrote:
In another thread, On 26-01-2021 17:44, Prentice Bisbal wrote:
Personally, I think it's good that Slurm RPMs are now available through EPEL, although I won't be able to use them, and I'm sure many people on the list won't be able to either, since licensing issues prevent them from providing support for NVIDIA drivers, so those of us with GPUs on our clusters will still have to compile Slurm from source to include NVIDIA GPU support.

We're running Slurm 20.02.6 and recently added some NVIDIA GPU nodes.
The Slurm GPU documentation seems to be
https://slurm.schedmd.com/gres.html
We don't seem to have any problems scheduling jobs on GPUs, even though our Slurm RPM build host doesn't have any NVIDIA software installed, as shown by the command:
$ ldconfig -p | grep libnvidia-ml

I'm curious about Prentice's statement about needing NVIDIA libraries to be installed when building Slurm RPMs, and I read the discussion in bug 9525,
https://bugs.schedmd.com/show_bug.cgi?id=9525
from which it seems that the problem was fixed in 20.02.6 and 20.11.

Question: Is there anything special that needs to be done when building Slurm RPMs with NVIDIA GPU support?

Thanks,
Ole

Reply via email to