;destroy" the cgroups created by slurm
> and therefore let the jobs out "into the wild".
>
> Best
> Marcus
>
> P.S.:
> We had a similar problem with LSF
>
> On 4/11/19 3:58 PM, Randall Radmer wrote:
>
> Yes, I was just testing that. Adding "Del
ld you please for a test add the following lines to the service part of
> the slurmd.service file (or add an override file).
>
> Delegate=yes
>
>
> Best
> Marcus
>
>
>
> On 4/11/19 3:11 PM, Randall Radmer wrote:
>
> It's now distressingly simple to reprod
Thanks Luca! I didn't know about these commands.
On Thu, Apr 11, 2019 at 1:53 AM Luca Capello wrote:
> Hi there,
>
> On 4/10/19 11:53 PM, Kilian Cavalotti wrote:
> > As far as I can tell, it looks like this is probably systemd messing
> > up with cgroups and deciding it's the king of cgroups on
It's now distressingly simple to reproduce this, based on Kilinan's clue
(off topic, "Kilinan's Clue" sounds like a good title for a Hardy Boys
Mystery Story).
After limited testing, seems to me that running "systemctl
daemon-reload" followed by "systemctl restart slurmd" breaks it. See
below:
Thanks Kilian! I'll look at this today.
-Randy
On Wed, Apr 10, 2019 at 3:59 PM Kilian Cavalotti <
kilian.cavalotti.w...@gmail.com> wrote:
> Hi Randy!
>
> > We have a slurm cluster with a number of nodes, some of which have more
> than one GPU. Users select how many or which GPUs they want with
We have a slurm cluster with a number of nodes, some of which have more
than one GPU. Users select how many or which GPUs they want with srun's
"--gres" option. Nothing fancy here, and in general this works as
expected. But starting a few days ago we've had problems on one machine.
A specific us
on combination
> or something like that.
> My first suspicion was my submission script since I changed it recently,
> but I could not find any error. scontrol reconfig did not help.
> But everything went well again, after I restarted the slurmctld.
>
> Might be worth a try.
>
/home/rradmer
Power=
On Mon, Apr 1, 2019 at 11:24 PM Marcus Wagner
wrote:
> Dear Randall,
>
> could you please also provide
>
>
> scontrol -d show node computelab-134
> scontrol -d show job 100091
> scontrol -d show job 100094
>
>
> Best
> Marcus
>
>
I can’t get backfill to work for a machine with two GPUs (one is a P4 and
the other a T4).
Submitting jobs works as expected: if the GPU I request is free, then my
job runs, otherwise it goes into a pending state. But if I have pending
jobs for one GPU ahead of pending jobs for the other GPU, I s
2KB) + Core L#5
> PU L#10 (P#5)
> PU L#11 (P#45)
> Slurm uses the logical cores so 10 and 11 gives you "physical" cores 5 and
> 45.
>
> Julie
>
>
>
> ----------
> *From:* slurm-users on behalf of
> Randall Radmer
&
I’m using GRES to manage eight GPUs in a node on a new Slurm cluster and am
trying to bind specific CPUs to specific GPUs, but it’s not working as I
expected.
I am able to request a specific number of GPUs, but the CPU assignment
seems wrong.
I assume I’m missing something obvious, but just can't
11 matches
Mail list logo