Thanks Paul!
On 26-01-2021 20:50, Paul Edmon wrote:
In our RPM spec we use to build slurm we do the following additional
things for GPU's:
BuildRequires: cuda-nvml-devel-11-1
the in the %build section we do:
export CFLAGS="$CFLAGS
-L/usr/local/cuda-11.1/targets/x86_64-linux/lib/stubs/
-I/usr/local/cuda-11.1/targets/x86_64-linux/include/"
That ensures the cuda libs are installed and it directs slurm to where
they are. After that configure should detect the nvml libs and link
against them.
I've attached our full spec that we use to build.
What I don't understand is, is it actually *required* to make the NVIDIA
libraries available to Slurm? I didn't do that, and I'm not aware of
any problems with our GPU nodes so far. Of course, our GPU nodes have
the libraries installed and the /dev/nvidia? devices are present.
Are some of Slurm's GPU features missing or broken without the
libraries? SchedMD's slurm.spec file doesn't mention any "--with
nvidia" (or similar) build options, so I'm really puzzled.
Most of our nodes don't have GPUs, so I wouldn't like to install
libraries on those nodes needlessly.
Thanks,
Ole
On 1/26/2021 2:29 PM, Ole Holm Nielsen wrote:
In another thread, On 26-01-2021 17:44, Prentice Bisbal wrote:
Personally, I think it's good that Slurm RPMs are now available
through EPEL, although I won't be able to use them, and I'm sure many
people on the list won't be able to either, since licensing issues
prevent them from providing support for NVIDIA drivers, so those of
us with GPUs on our clusters will still have to compile Slurm from
source to include NVIDIA GPU support.
We're running Slurm 20.02.6 and recently added some NVIDIA GPU nodes.
The Slurm GPU documentation seems to be
https://slurm.schedmd.com/gres.html
We don't seem to have any problems scheduling jobs on GPUs, even
though our Slurm RPM build host doesn't have any NVIDIA software
installed, as shown by the command:
$ ldconfig -p | grep libnvidia-ml
I'm curious about Prentice's statement about needing NVIDIA libraries
to be installed when building Slurm RPMs, and I read the discussion in
bug 9525,
https://bugs.schedmd.com/show_bug.cgi?id=9525
from which it seems that the problem was fixed in 20.02.6 and 20.11.
Question: Is there anything special that needs to be done when
building Slurm RPMs with NVIDIA GPU support?
Thanks,
Ole