On Mon, Sep 24, 2018 at 12:27 PM Will Dennis <wden...@nec-labs.com> wrote: > > Hi all, > > We want to add in some Gres resource types pertaining to GPUs (amount of GPU > memory and CUDA cores) on some of our nodes. So we added the following params > into the 'gres.conf' on the nodes that have GPUs: > > Name=gpu_mem Count=<#>G > Name=gpu_cores Count=<#>
I just have a single gres.conf that's copied to all nodes, same as slurm.conf. It lists NodeName=x Count=y Name=w for each node & gres. > And in slurm.conf: > > GresTypes=gpu,gpu_mem,gpu_cores > > And down in the NodeName lines for these servers: > > Gres=gpu:<#>,gpu_mem:no_consume:<#>G,gpu_cores:no_consume:<#> I'm not using the :no_consume syntax, simply Gres=name:#,y:z,... Of course after changes copy gres & slurm.conf to all nodes and scontrol reconfigure works great for me. > (where <#> of course is the relevant numerical value) > > However, upon restarting the slurmctld on the controller, and the slurmd on > the clients, the nodes appear to be unhappy with this, giving a message such > as: > > Reason=gres/gpu_mem count too low (0 < 4294967296) [root@2018-09-24T11:36:01] > > And of course are then going into DRAIN mode. > > We are running Slurm v16.04.5, is doing something like the above a > possibility on this version? If so, what could be the problem? > >