Re: [slurm-users] GPU / cgroup challenges

2018-05-21 Thread R. Paul Wiegand
I am following up on this to first thank everyone for their suggestion and also let you know that indeed, ugrading from 17.11.0 to 17.11.6 solved the problem. Our GPUs are now properly walled off via cgroups per our existing config. Thanks! Paul. > On May 5, 2018, at 9:04 AM, Chris Samuel w

Re: [slurm-users] GPU / cgroup challenges

2018-05-05 Thread Chris Samuel
On Wednesday, 2 May 2018 11:04:34 PM AEST R. Paul Wiegand wrote: > When I set "--gres=gpu:1", the slurmd log does have encouraging lines such > as: > > [2018-05-02T08:47:04.916] [203.0] debug: Allowing access to device > /dev/nvidia0 for job > [2018-05-02T08:47:04.916] [203.0] debug: Not allowi

Re: [slurm-users] GPU / cgroup challenges

2018-05-02 Thread Wiegand, Paul
So there is a patch? -- Original message-- From: Fulcomer, Samuel Date: Wed, May 2, 2018 11:14 To: Slurm User Community List; Cc: Subject:Re: [slurm-users] GPU / cgroup challenges This came up around 12/17, I think, and as I recall the fixes were added to the src repo then; however

Re: [slurm-users] GPU / cgroup challenges

2018-05-02 Thread Fulcomer, Samuel
This came up around 12/17, I think, and as I recall the fixes were added to the src repo then; however, they weren't added to any fo the 17.releases. On Wed, May 2, 2018 at 6:04 AM, R. Paul Wiegand wrote: > I dug into the logs on both the slurmctld side and the slurmd side. > For the record, I h

Re: [slurm-users] GPU / cgroup challenges

2018-05-02 Thread R. Paul Wiegand
I dug into the logs on both the slurmctld side and the slurmd side. For the record, I have debug2 set for both and DebugFlags=CPU_BIND,Gres. I cannot see much that is terribly relevant in the logs. There's a known parameter error reported with the memory cgroup specifications, but I don't think t

Re: [slurm-users] GPU / cgroup challenges

2018-05-01 Thread Christopher Samuel
On 02/05/18 10:15, R. Paul Wiegand wrote: Yes, I am sure they are all the same. Typically, I just scontrol reconfig; however, I have also tried restarting all daemons. Understood. Any diagnostics in the slurmd logs when trying to start a GPU job on the node? We are moving to 7.4 in a few we

Re: [slurm-users] GPU / cgroup challenges

2018-05-01 Thread R. Paul Wiegand
Yes, I am sure they are all the same. Typically, I just scontrol reconfig; however, I have also tried restarting all daemons. We are moving to 7.4 in a few weeks during our downtime. We had a QDR -> OFED version constraint -> Lustre client version constraint issue that delayed our upgrade. Shou

Re: [slurm-users] GPU / cgroup challenges

2018-05-01 Thread Christopher Samuel
On 02/05/18 09:31, R. Paul Wiegand wrote: Slurm 17.11.0 on CentOS 7.1 That's quite old (on both fronts, RHEL 7.1 is from 2015), we started on that same Slurm release but didn't do the GPU cgroup stuff until a later version (17.11.3 on RHEL 7.4). I don't see anything in the NEWS file about rel

Re: [slurm-users] GPU / cgroup challenges

2018-05-01 Thread Kevin Manalo
May 1, 2018 at 7:34 PM To: Slurm User Community List Subject: Re: [slurm-users] GPU / cgroup challenges Slurm 17.11.0 on CentOS 7.1 On Tue, May 1, 2018, 19:26 Christopher Samuel mailto:ch...@csamuel.org>> wrote: On 02/05/18 09:23, R. Paul Wiegand wrote: > I thought including the /dev/

Re: [slurm-users] GPU / cgroup challenges

2018-05-01 Thread R. Paul Wiegand
Slurm 17.11.0 on CentOS 7.1 On Tue, May 1, 2018, 19:26 Christopher Samuel wrote: > On 02/05/18 09:23, R. Paul Wiegand wrote: > > > I thought including the /dev/nvidia* would whitelist those devices > > ... which seems to be the opposite of what I want, no? Or do I > > misunderstand? > > No, I t

Re: [slurm-users] GPU / cgroup challenges

2018-05-01 Thread R. Paul Wiegand
Thanks Chris. I do have the ConstrainDevices turned on. Are the differences in your cgroup_allowed_devices_file.conf relevant in this case? On Tue, May 1, 2018, 19:23 Christopher Samuel wrote: > On 02/05/18 09:00, Kevin Manalo wrote: > > > Also, I recall appending this to the bottom of > > > >

Re: [slurm-users] GPU / cgroup challenges

2018-05-01 Thread Christopher Samuel
On 02/05/18 09:23, R. Paul Wiegand wrote: I thought including the /dev/nvidia* would whitelist those devices ... which seems to be the opposite of what I want, no? Or do I misunderstand? No, I think you're right there, we don't have them listed and cgroups constrains it correctly (nvidia-smi

Re: [slurm-users] GPU / cgroup challenges

2018-05-01 Thread R. Paul Wiegand
Thanks Kevin! Indeed, nvidia-smi in an interactive job tells me that I can get access to the device when I should not be able to. I thought including the /dev/nvidia* would whitelist those devices ... which seems to be the opposite of what I want, no? Or do I misunderstand? Thanks, Paul On Tue

Re: [slurm-users] GPU / cgroup challenges

2018-05-01 Thread Christopher Samuel
On 02/05/18 09:00, Kevin Manalo wrote: Also, I recall appending this to the bottom of [cgroup_allowed_devices_file.conf] .. Same as yours ... /dev/nvidia* There was a SLURM bug issue that made this clear, not so much in the website docs. That shouldn't be necessary, all we have for this is.

Re: [slurm-users] GPU / cgroup challenges

2018-05-01 Thread Kevin Manalo
Paul, Having recently set this up, this was my test, when you make a single GPU request from inside an interactive run (salloc ... --gres=gpu:1 srun --pty bash) request you should only see the GPU assigned to you via 'nvidia-smi' When gres is unset you should see nvidia-smi No devices were f

[slurm-users] GPU / cgroup challenges

2018-05-01 Thread R. Paul Wiegand
Greetings, I am setting up our new GPU cluster, and I seem to have a problem configuring things so that the devices are properly walled off via cgroups. Our nodes each of two GPUS; however, if --gres is unset, or set to --gres=gpu:0, I can access both GPUs from inside a job. Moreover, if I ask fo