On 1/21/2020 12:32 AM, Chris Samuel wrote:
On 20/1/20 3:00 pm, Dean Schulze wrote:
There's either a problem with the source code I cloned from github,
or there is a problem when the controller runs on Ubuntu 19 and the
node runs on CentOS 7.7. I'm downgrading to a stable 19.05 build to
see if that solves the problem.
I've run the master branch on a Cray XC without issues, and I concur
with what the others have said and suggest it's worth checking the
slurmd and slurmctld logs to find out why communications is not right
between them.
and if the logs do not have enough information, run the daemon in the
foreground with increased verbosity
slurmd -D -v -v -v
As another said, check if the connections are available with telnet
server->client 'telnet node1 6818' (6818 is the default slurmd port) and
same from compute->server.
Are these new host builds? Is there a firewall enabled? Kinda sounds
like a firewall on the client that allows outbound (initial connection
to the slurmctl) but not new inbound (slurmctl ping) connections.
-b