Re: [slurm-users] How does cgroups limit user access to GPUs?

2019-04-11 Thread Randall Radmer
;destroy" the cgroups created by slurm > and therefore let the jobs out "into the wild". > > Best > Marcus > > P.S.: > We had a similar problem with LSF > > On 4/11/19 3:58 PM, Randall Radmer wrote: > > Yes, I was just testing that. Adding "Del

Re: [slurm-users] How does cgroups limit user access to GPUs?

2019-04-11 Thread Randall Radmer
ld you please for a test add the following lines to the service part of > the slurmd.service file (or add an override file). > > Delegate=yes > > > Best > Marcus > > > > On 4/11/19 3:11 PM, Randall Radmer wrote: > > It's now distressingly simple to reprod

Re: [slurm-users] How does cgroups limit user access to GPUs?

2019-04-11 Thread Randall Radmer
Thanks Luca! I didn't know about these commands. On Thu, Apr 11, 2019 at 1:53 AM Luca Capello wrote: > Hi there, > > On 4/10/19 11:53 PM, Kilian Cavalotti wrote: > > As far as I can tell, it looks like this is probably systemd messing > > up with cgroups and deciding it's the king of cgroups on

Re: [slurm-users] How does cgroups limit user access to GPUs?

2019-04-11 Thread Randall Radmer
It's now distressingly simple to reproduce this, based on Kilinan's clue (off topic, "Kilinan's Clue" sounds like a good title for a Hardy Boys Mystery Story). After limited testing, seems to me that running "systemctl daemon-reload" followed by "systemctl restart slurmd" breaks it. See below:

Re: [slurm-users] How does cgroups limit user access to GPUs?

2019-04-11 Thread Randall Radmer
Thanks Kilian! I'll look at this today. -Randy On Wed, Apr 10, 2019 at 3:59 PM Kilian Cavalotti < kilian.cavalotti.w...@gmail.com> wrote: > Hi Randy! > > > We have a slurm cluster with a number of nodes, some of which have more > than one GPU. Users select how many or which GPUs they want with

[slurm-users] How does cgroups limit user access to GPUs?

2019-04-10 Thread Randall Radmer
We have a slurm cluster with a number of nodes, some of which have more than one GPU. Users select how many or which GPUs they want with srun's "--gres" option. Nothing fancy here, and in general this works as expected. But starting a few days ago we've had problems on one machine. A specific us

Re: [slurm-users] Backfill isn’t working for a node with two GPUs that have different GRES types.

2019-04-03 Thread Randall Radmer
on combination > or something like that. > My first suspicion was my submission script since I changed it recently, > but I could not find any error. scontrol reconfig did not help. > But everything went well again, after I restarted the slurmctld. > > Might be worth a try. >

Re: [slurm-users] Backfill isn’t working for a node with two GPUs that have different GRES types.

2019-04-02 Thread Randall Radmer
/home/rradmer Power= On Mon, Apr 1, 2019 at 11:24 PM Marcus Wagner wrote: > Dear Randall, > > could you please also provide > > > scontrol -d show node computelab-134 > scontrol -d show job 100091 > scontrol -d show job 100094 > > > Best > Marcus > >

[slurm-users] Backfill isn’t working for a node with two GPUs that have different GRES types.

2019-04-01 Thread Randall Radmer
I can’t get backfill to work for a machine with two GPUs (one is a P4 and the other a T4). Submitting jobs works as expected: if the GPU I request is free, then my job runs, otherwise it goes into a pending state. But if I have pending jobs for one GPU ahead of pending jobs for the other GPU, I s

Re: [slurm-users] Using GRES to manage GPUs, but unable to assign specific CPUs to specific GPUs

2018-09-18 Thread Randall Radmer
2KB) + Core L#5 > PU L#10 (P#5) > PU L#11 (P#45) > Slurm uses the logical cores so 10 and 11 gives you "physical" cores 5 and > 45. > > Julie > > > > ---------- > *From:* slurm-users on behalf of > Randall Radmer &

[slurm-users] Using GRES to manage GPUs, but unable to assign specific CPUs to specific GPUs

2018-09-12 Thread Randall Radmer
I’m using GRES to manage eight GPUs in a node on a new Slurm cluster and am trying to bind specific CPUs to specific GPUs, but it’s not working as I expected. I am able to request a specific number of GPUs, but the CPU assignment seems wrong. I assume I’m missing something obvious, but just can't