On Mon, Sep 24, 2018 at 12:27 PM Will Dennis <wden...@nec-labs.com> wrote:
>
> Hi all,
>
> We want to add in some Gres resource types pertaining to GPUs (amount of GPU 
> memory and CUDA cores) on some of our nodes. So we added the following params 
> into the 'gres.conf' on the nodes that have GPUs:
>
> Name=gpu_mem Count=<#>G
> Name=gpu_cores Count=<#>

I just have a single gres.conf that's copied to all nodes, same as
slurm.conf. It lists NodeName=x Count=y Name=w for each node & gres.

> And in slurm.conf:
>
> GresTypes=gpu,gpu_mem,gpu_cores
>
> And down in the NodeName lines for these servers:
>
> Gres=gpu:<#>,gpu_mem:no_consume:<#>G,gpu_cores:no_consume:<#>

I'm not using the :no_consume syntax, simply Gres=name:#,y:z,...
Of course after changes copy gres & slurm.conf to all nodes and
scontrol reconfigure works great for me.

> (where <#> of course is the relevant numerical value)
>
> However, upon restarting the slurmctld on the controller, and the slurmd on 
> the clients, the nodes appear to be unhappy with this, giving a message such 
> as:
>
> Reason=gres/gpu_mem count too low (0 < 4294967296) [root@2018-09-24T11:36:01]
>
> And of course are then going into DRAIN mode.
>
> We are running Slurm v16.04.5, is doing something like the above a 
> possibility on this version? If so, what could be the problem?
>
>

Reply via email to