Re: [slurm-users] 2 nodes being randomly set to "not responding"

jose Wed, 21 Jul 2021 13:57:43 -0700

Hi, most likely you might want to set it in exact opposite way, as slurm cloud 
scheduling guide says:


"TreeWidth Since the slurmd daemons are not aware of the network addresses of 
other nodes in the cloud, the slurmd daemons on each node should be sent 
messages directly and not forward those messages between each other. To do so, 
configure TreeWidth to a number at least as large as the maximum node count. 
The value may not exceed 65533."

source: https://slurm.schedmd.com/elastic_computing.html 

Cheers 

Josef 

Sent from Nine

________________________________
From: Russell Jones <arjone...@gmail.com>
Sent: Wednesday, 21 July 2021 22:30
To: Slurm User Community List
Subject: [slurm-users] 2 nodes being randomly set to "not responding"

Hi all,

We have a single slurm cluster with multiple different architectures and 
compute clusters talking to a single slurmctld. This slurmctld is dual-homed on 
two different networks. We have two individual nodes who are by themselves on 
"network 2" while all of the other nodes are on "network 1".  They will stay 
online for a short period of time, but then be marked as down and not 
responding by slurmctld. 10 to 20 minutes later they will be back online, rinse 
and repeat. There are absolutely no firewalls involved anywhere in the network.

I found a mailing list post back in 2018 where a guy mentioned that slurmd's 
all expect to be able to talk to each other, and when you have some nodes 
segmented off from others you can get this flapping behavior. He mentioned to 
try setting TreeWidth to 1 to force the slurmd's to only communicate directly 
with slurmctld. I gave that a shot and it unfortunately seemed to make all of 
the other nodes no longer be reachable! :-)

Is there a way of properly configuring our setup so that we can have a proper 
dual-homed slurmctld and not require every node be reachable by every other 
node?

Re: [slurm-users] 2 nodes being randomly set to "not responding"

Reply via email to