Yeah, these are frustrating ones to troubleshoot. When I have seen this
in the past it was usually a missing forward or reverse in DNS that
cause the problem. You could try dialing up the verbosity all the way
and see what you can spot. Else I might recommend dropping a ticket
into the SchedMD guys to see if they have any more insight. Then again
some one on this list might have seen the same issue.
-Paul Edmon-
On 11/7/18 10:20 AM, Scott Hazelhurst wrote:
Thanks, Paul, yes, it does seem a likely cause, but I can’t see the problem.
All machines have the same /etc/hosts file and the worker nodes are just listed
one after each other. I’ve checked that the problem nodes are there — no
obvious difference. I’ve checked that the IP address is correct.
Moreover, I can ping and ssh either using the node name (e.g. n38) or the fqdn
Scott
On 07 Nov 2018, at 16:57, Paul Edmon <ped...@cfa.harvard.edu> wrote:
This smacks of either the submission host, the destination host, or the master
not being able to resolve the name to an IP. I would triple check that to
ensure that resolution is working.
-Paul Edmon-
This communication is intended for the addressee only. It is confidential. If
you have received this communication in error, please notify us immediately and
destroy the original message. You may not copy or disseminate this
communication without the permission of the University. Only authorised
signatories are competent to enter into agreements on behalf of the University
and recipients are thus advised that the content of this message may not be
legally binding on the University and may contain the personal views and
opinions of the author, which are not necessarily the views and opinions of The
University of the Witwatersrand, Johannesburg. All agreements between the
University and outsiders are subject to South African Law unless the University
agrees in writing to the contrary.