That appears to have fixed it. Thank you! On Wed, Jul 21, 2021 at 3:58 PM <j...@fzu.cz> wrote:
> Hi, most likely you might want to set it in exact opposite way, as slurm > cloud scheduling guide says: > > "*TreeWidth *Since the slurmd daemons are not aware of the network > addresses of other nodes in the cloud, the slurmd daemons on each node > should be sent messages directly and not forward those messages between > each other. To do so, configure TreeWidth to a number at least as large as > the maximum node count. The value may not exceed 65533." > > source: https://slurm.schedmd.com/elastic_computing.html > > Cheers > > Josef > > Sent from Nine <http://www.9folders.com/> > > ------------------------------ > *From:* Russell Jones <arjone...@gmail.com> > *Sent:* Wednesday, 21 July 2021 22:30 > *To:* Slurm User Community List > *Subject:* [slurm-users] 2 nodes being randomly set to "not responding" > > Hi all, > > We have a single slurm cluster with multiple different architectures and > compute clusters talking to a single slurmctld. This slurmctld is > dual-homed on two different networks. We have two individual nodes who are > by themselves on "network 2" while all of the other nodes are on "network > 1". They will stay online for a short period of time, but then be marked > as down and not responding by slurmctld. 10 to 20 minutes later they will > be back online, rinse and repeat. There are absolutely no firewalls > involved anywhere in the network. > > I found a mailing list post back in 2018 where a guy mentioned that > slurmd's all expect to be able to talk to each other, and when you have > some nodes segmented off from others you can get this flapping behavior. He > mentioned to try setting TreeWidth to 1 to force the slurmd's to only > communicate directly with slurmctld. I gave that a shot and it > unfortunately seemed to make all of the other nodes no longer be reachable! > :-) > > Is there a way of properly configuring our setup so that we can have a > proper dual-homed slurmctld and not require every node be reachable by > every other node? > > >