Hi, I'm experiencing a connectivity problem and I'm out of ideas, why this is happening. I'm running a slurmctld on a multihomed host.
(10.9.8.0/8) - master - (10.11.12.0/8) There is no routing between these two subnets. So far, all slurmds resided in the first subnet and worked fine. I added some in the second subnet and they keep changing into the DOWN state. I checked the "last slurmd control message" and sometimes it's overdue for 20 minutes and more with a configured slurmd timeout of 5 minutes. I did a tcpdump and it showed that the slurmctld isn't even trying to connect to the slurmds at that time. I haven't found any packet loss yet, the redundant DNS servers are both resolving the host names properly at that time and slurmctld just states a communications error for the ping request while slurmds are running and all hosts are idle. What reasons can there be for not contacting the slurmds? Or is it more likely that the reply gets lost on its way? Gerhard