Andy
I appreciate you making me check again, things do get missed
SELinux is off, firewalld is disabled
[root@SRVGRIDSLURM01 ~]# sestatus
SELinux status: disabled
[root@SRVGRIDSLURM01 ~]# systemctl status firewalld
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor
preset: enabled)
Active: inactive (dead)
Docs: man:firewalld(1)
The one thing I can think of is that the system running slurmctld has two
network interfaces. It serves as a gateway, so has two network address. The two
of the test slurmd's are on the other side of that gateway box, one is on the
same box. But the two on the other side of the gateway, have a different IP
address range and possibly mask
this is from slurm.conf for the three nodes. I know they are talking; I can see
it in the logs when set to a debug logging level
the nodename info comes from slurmd -C, so that is correct
added the IP address, but that did not matter
# COMPUTE NODES
NodeName=SRVGRIDSLURM01 NodeAddr=192.168.1.60 CPUs=4 Boards=1 SocketsPerBoard=1
CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7821
NodeName=SRVGRIDSLURM02 NodeAddr=192.168.1.61 CPUs=4 Boards=1 SocketsPerBoard=1
CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7821
NodeName=srvgridslurm03 NodeAddr=192.168.1.62 CPUs=4 Boards=1 SocketsPerBoard=1
CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7821
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
about the only thing I can think of is to make one of the nodes on the
otherside of the gateway into the control node
Steve Bland
Technical Product Manager
Third Party Products
Ross Video | Production Technology Experts
T: +1 (613) 228-0688 ext.4219
www.rossvideo.com<http://www.rossvideo.com/>
________________________________
From: Andy Riebs <[email protected]> on behalf of Andy Riebs
<[email protected]>
Sent: 26 November 2020 13:40
To: Steve Bland <[email protected]>; Slurm User Community List
<[email protected]>
Subject: Re: [EXTERNAL] Re: [slurm-users] trying to diagnose a connectivity
issue between the slurmctld process and the slurmd nodes
One last shot on the firewall front Steve -- does the control node have a
firewall enabled? I've seen cases where that can cause the sporadic messaging
failures that you seem to be seeing.
That failing, I'll defer to anyone with different ideas!
Andy
On 11/26/2020 1:01 PM, Steve Bland wrote:
----------------------------------------------
This e-mail and any attachments may contain information that is confidential to
Ross Video.
If you are not the intended recipient, please notify me immediately by replying
to this message. Please also delete all copies. Thank you.