Re: [slurm-users] Issues with HA config and AllocNodes

2020-01-23 Thread Dave Sizer
Bumping on this thread.. this issue persists even after upgrade to 19.05.4. Does anyone have an HA setup that could provide some insight? From: Dave Sizer Date: Thursday, December 19, 2019 at 9:44 AM To: Slurm User Community List , Brian Andrus Subject: Re: [slurm-users] Issues with HA

Re: [slurm-users] Issues with HA config and AllocNodes

2019-12-19 Thread Dave Sizer
, and this happens even when swapping the primary/backup roles of the nodes. I am digging through the source to try and find some hints. Does anyone have any ideas? From: slurm-users on behalf of Dave Sizer Reply-To: Slurm User Community List Date: Tuesday, December 17, 2019 at 1:05 PM To

Re: [slurm-users] Issues with HA config and AllocNodes

2019-12-17 Thread Dave Sizer
some issue with the saving/loading of partition state during takeover, I’m just a bit stumped on why it is happening and what to do to stop partitions being loaded with the AllocNodes=none config. From: Brian Andrus Date: Tuesday, December 17, 2019 at 12:30 PM To: Dave Sizer Subject: Re

[slurm-users] Issues with HA config and AllocNodes

2019-12-17 Thread Dave Sizer
Hello friends, We are running slurm 19.05.1-2 with an HA setup consisting of one primary and one backup controller. However, we are observing that when the backup takes over, for some reason AllocNodes is getting set to “none” on all of our partitions. We can remedy this by manually setting A

[slurm-users] Disable socket timeouts for debugging

2017-11-08 Thread Dave Sizer
Hi, I am debugging slurmd on a worker node with gdb, and I was wondering if there was a way to disable the socket timeouts between slurmctld and slurmd so that my jobs don't fail while I'm stepping through code. Thanks ---