[slurm-users] Slurm version 18.08.7 is now available

2019-04-11 Thread Tim Wickberg
We are pleased to announce the availability of Slurm version 18.08.7. This includes over 20 fixes since 18.08.6 was released last month, include one for a regression that caused issues with 'sacct -J' not returning results correctly. Slurm can be downloaded from https://www.schedmd.com/downlo

Re: [slurm-users] How does cgroups limit user access to GPUs?

2019-04-11 Thread Christopher Samuel
On 4/11/19 8:27 AM, Randall Radmer wrote: I guess my next question is, are there any negative repercussions to setting "Delegate=yes" in slurmd.service? This was Slurm bug 5292 and was fixed last year: https://bugs.schedmd.com/show_bug.cgi?id=5292 # Commit cecb39ff087731d2 adds Delegate=yes

Re: [slurm-users] How does cgroups limit user access to GPUs?

2019-04-11 Thread Randall Radmer
I guess my next question is, are there any negative repercussions to setting "Delegate=yes" in slurmd.service? On Thu, Apr 11, 2019 at 8:21 AM Marcus Wagner wrote: > I assume without Delegate=yes this would happen also to regular jobs, > which means, nightly updates could "destroy" the cgroups c

Re: [slurm-users] How does cgroups limit user access to GPUs?

2019-04-11 Thread Marcus Wagner
I assume without Delegate=yes this would happen also to regular jobs, which means, nightly updates could "destroy" the cgroups created by slurm and therefore let the jobs out "into the wild". Best Marcus P.S.: We had a similar problem with LSF On 4/11/19 3:58 PM, Randall Radmer wrote: Yes, I

Re: [slurm-users] How does cgroups limit user access to GPUs?

2019-04-11 Thread Randall Radmer
Yes, I was just testing that. Adding "Delegate=yes" seems to fix the problem (see below), but wanted to try a few more things before saying anything. [computelab-136:~]$ grep ^Delegate /etc/systemd/system/slurmd.service Delegate=yes [computelab-136:~]$ nvidia-smi --query-gpu=index,name --format=c

Re: [slurm-users] How does cgroups limit user access to GPUs?

2019-04-11 Thread Marcus Wagner
Hi Randall, could you please for a test add the following lines to the service part of the slurmd.service file (or add an override file). Delegate=yes Best Marcus On 4/11/19 3:11 PM, Randall Radmer wrote: It's now distressingly simple to reproduce this, based on Kilinan's clue (off topic

Re: [slurm-users] How does cgroups limit user access to GPUs?

2019-04-11 Thread Randall Radmer
Thanks Luca! I didn't know about these commands. On Thu, Apr 11, 2019 at 1:53 AM Luca Capello wrote: > Hi there, > > On 4/10/19 11:53 PM, Kilian Cavalotti wrote: > > As far as I can tell, it looks like this is probably systemd messing > > up with cgroups and deciding it's the king of cgroups on

Re: [slurm-users] How does cgroups limit user access to GPUs?

2019-04-11 Thread Randall Radmer
It's now distressingly simple to reproduce this, based on Kilinan's clue (off topic, "Kilinan's Clue" sounds like a good title for a Hardy Boys Mystery Story). After limited testing, seems to me that running "systemctl daemon-reload" followed by "systemctl restart slurmd" breaks it. See below:

Re: [slurm-users] How does cgroups limit user access to GPUs?

2019-04-11 Thread Randall Radmer
Thanks Kilian! I'll look at this today. -Randy On Wed, Apr 10, 2019 at 3:59 PM Kilian Cavalotti < kilian.cavalotti.w...@gmail.com> wrote: > Hi Randy! > > > We have a slurm cluster with a number of nodes, some of which have more > than one GPU. Users select how many or which GPUs they want with

Re: [slurm-users] How does cgroups limit user access to GPUs?

2019-04-11 Thread Luca Capello
Hi there, On 4/10/19 11:53 PM, Kilian Cavalotti wrote: > As far as I can tell, it looks like this is probably systemd messing > up with cgroups and deciding it's the king of cgroups on the host. FYI, given that I found no mention of those tools, `systemd-cgls` et `systemd-cgtop` help when debuggi