I am following up on this to first thank everyone for their suggestion and also
let you know that indeed, ugrading from 17.11.0 to 17.11.6 solved the problem.
Our GPUs are now properly walled off via cgroups per our existing config.
Thanks!
Paul.
> On May 5, 2018, at 9:04 AM, Chris Samuel w
On Wednesday, 2 May 2018 11:04:34 PM AEST R. Paul Wiegand wrote:
> When I set "--gres=gpu:1", the slurmd log does have encouraging lines such
> as:
>
> [2018-05-02T08:47:04.916] [203.0] debug: Allowing access to device
> /dev/nvidia0 for job
> [2018-05-02T08:47:04.916] [203.0] debug: Not allowi
So there is a patch?
-- Original message--
From: Fulcomer, Samuel
Date: Wed, May 2, 2018 11:14
To: Slurm User Community List;
Cc:
Subject:Re: [slurm-users] GPU / cgroup challenges
This came up around 12/17, I think, and as I recall the fixes were added to the
src repo then; however
This came up around 12/17, I think, and as I recall the fixes were added to
the src repo then; however, they weren't added to any fo the 17.releases.
On Wed, May 2, 2018 at 6:04 AM, R. Paul Wiegand wrote:
> I dug into the logs on both the slurmctld side and the slurmd side.
> For the record, I h
I dug into the logs on both the slurmctld side and the slurmd side.
For the record, I have debug2 set for both and
DebugFlags=CPU_BIND,Gres.
I cannot see much that is terribly relevant in the logs. There's a
known parameter error reported with the memory cgroup specifications,
but I don't think t
On 02/05/18 10:15, R. Paul Wiegand wrote:
Yes, I am sure they are all the same. Typically, I just scontrol
reconfig; however, I have also tried restarting all daemons.
Understood. Any diagnostics in the slurmd logs when trying to start
a GPU job on the node?
We are moving to 7.4 in a few we
Yes, I am sure they are all the same. Typically, I just scontrol reconfig;
however, I have also tried restarting all daemons.
We are moving to 7.4 in a few weeks during our downtime. We had a QDR ->
OFED version constraint -> Lustre client version constraint issue that
delayed our upgrade.
Shou
On 02/05/18 09:31, R. Paul Wiegand wrote:
Slurm 17.11.0 on CentOS 7.1
That's quite old (on both fronts, RHEL 7.1 is from 2015), we started on
that same Slurm release but didn't do the GPU cgroup stuff until a later
version (17.11.3 on RHEL 7.4).
I don't see anything in the NEWS file about rel
May 1, 2018 at 7:34 PM
To: Slurm User Community List
Subject: Re: [slurm-users] GPU / cgroup challenges
Slurm 17.11.0 on CentOS 7.1
On Tue, May 1, 2018, 19:26 Christopher Samuel
mailto:ch...@csamuel.org>> wrote:
On 02/05/18 09:23, R. Paul Wiegand wrote:
> I thought including the /dev/
Slurm 17.11.0 on CentOS 7.1
On Tue, May 1, 2018, 19:26 Christopher Samuel wrote:
> On 02/05/18 09:23, R. Paul Wiegand wrote:
>
> > I thought including the /dev/nvidia* would whitelist those devices
> > ... which seems to be the opposite of what I want, no? Or do I
> > misunderstand?
>
> No, I t
Thanks Chris. I do have the ConstrainDevices turned on. Are the
differences in your cgroup_allowed_devices_file.conf relevant in this case?
On Tue, May 1, 2018, 19:23 Christopher Samuel wrote:
> On 02/05/18 09:00, Kevin Manalo wrote:
>
> > Also, I recall appending this to the bottom of
> >
> >
On 02/05/18 09:23, R. Paul Wiegand wrote:
I thought including the /dev/nvidia* would whitelist those devices
... which seems to be the opposite of what I want, no? Or do I
misunderstand?
No, I think you're right there, we don't have them listed and cgroups
constrains it correctly (nvidia-smi
Thanks Kevin!
Indeed, nvidia-smi in an interactive job tells me that I can get access to
the device when I should not be able to.
I thought including the /dev/nvidia* would whitelist those devices ...
which seems to be the opposite of what I want, no? Or do I misunderstand?
Thanks,
Paul
On Tue
On 02/05/18 09:00, Kevin Manalo wrote:
Also, I recall appending this to the bottom of
[cgroup_allowed_devices_file.conf]
..
Same as yours
...
/dev/nvidia*
There was a SLURM bug issue that made this clear, not so much in the website
docs.
That shouldn't be necessary, all we have for this is.
Paul,
Having recently set this up, this was my test, when you make a single GPU
request from inside an interactive run (salloc ... --gres=gpu:1 srun --pty
bash) request you should only see the GPU assigned to you via 'nvidia-smi'
When gres is unset you should see
nvidia-smi
No devices were f
Greetings,
I am setting up our new GPU cluster, and I seem to have a problem
configuring things so that the devices are properly walled off via
cgroups. Our nodes each of two GPUS; however, if --gres is unset, or
set to --gres=gpu:0, I can access both GPUs from inside a job.
Moreover, if I ask fo
16 matches
Mail list logo