Re: [slurm-users] GPU / cgroup challenges

R. Paul Wiegand Mon, 21 May 2018 04:19:05 -0700

I am following up on this to first thank everyone for their suggestion and also 
let you know that indeed, ugrading from 17.11.0 to 17.11.6 solved the problem.  
Our GPUs are now properly walled off via cgroups per our existing config.


Thanks!

Paul.


> On May 5, 2018, at 9:04 AM, Chris Samuel <ch...@csamuel.org> wrote:
> 
> On Wednesday, 2 May 2018 11:04:34 PM AEST R. Paul Wiegand wrote:
> 
>> When I set "--gres=gpu:1", the slurmd log does have encouraging lines such
>> as:
>> 
>> [2018-05-02T08:47:04.916] [203.0] debug:  Allowing access to device
>> /dev/nvidia0 for job
>> [2018-05-02T08:47:04.916] [203.0] debug:  Not allowing access to
>> device /dev/nvidia1 for job
>> 
>> However, I can still "see" both devices from nvidia-smi, and I can
>> still access both if I manually unset CUDA_VISIBLE_DEVICES.
> 
> The only thing I can think of is a bug that's been fixed since 17.11.0 (as I 
> know it works for us with 17.11.5) or a kernel bug (or missing device 
> cgroups).
> 
> Sorry I can't be more helpful!
> 
> All the best,
> Chris
> -- 
> Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC
> 
>

Re: [slurm-users] GPU / cgroup challenges

Reply via email to