Bumping on this thread.. this issue persists even after upgrade to 19.05.4.
Does anyone have an HA setup that could provide some insight?
From: Dave Sizer
Date: Thursday, December 19, 2019 at 9:44 AM
To: Slurm User Community List , Brian Andrus
Subject: Re: [slurm-users] Issues with HA
, and this happens even when
swapping the primary/backup roles of the nodes. I am digging through the source
to try and find some hints.
Does anyone have any ideas?
From: slurm-users on behalf of Dave
Sizer
Reply-To: Slurm User Community List
Date: Tuesday, December 17, 2019 at 1:05 PM
To
some issue with the saving/loading of
partition state during takeover, I’m just a bit stumped on why it is happening
and what to do to stop partitions being loaded with the AllocNodes=none config.
From: Brian Andrus
Date: Tuesday, December 17, 2019 at 12:30 PM
To: Dave Sizer
Subject: Re
Hello friends,
We are running slurm 19.05.1-2 with an HA setup consisting of one primary and
one backup controller. However, we are observing that when the backup takes
over, for some reason AllocNodes is getting set to “none” on all of our
partitions. We can remedy this by manually setting A
Hi,
I am debugging slurmd on a worker node with gdb, and I was wondering if there
was a way to disable the socket timeouts between slurmctld and slurmd so that
my jobs don't fail while I'm stepping through code.
Thanks
---