On 2/28/2018 6:54 AM, James Keeney wrote:
I did notice one thing in the logs:
2018-02-28 13:21:58,932 [myid:1] - INFO
[/172.31.86.130:3888:QuorumCnxManager$Listener@743] - *Received
connection request /172.31.73.122:34804 <http://172.31.73.122:34804>*
<snip>
When the restarted node attempts to reconnect with the ensemble it
looks like it does so on a random port. Could it be that nodes in the
ensemble are rejecting the new request to rejoin because they are not
listening on that port? And why is it not requesting on 3888:2888?
This is confusing to me.
That appears to be the source port. Which is generally going to be a
very high port and semi-unpredictable.Normal TCP operation.
I have attached a ZK log and a SOLR log. You can watch the whole
progression in the ZK log as it goes from happy to disconnected to
trying to reconnect to part of the ensemble when the other nodes are
restarted. Seems like ZK holds onto a state based on the original
ensemble interactions and that state prevents the node from rejoining
the ensemble. The state is then lost with the restart which allows the
members to re-establish connection and form the new ensemble.
What timestamps correspond to the actions you took?
Lots and lots of connections refused. Unless there's something
preventing network access, I would only expect connections to be refused
if the software isn't running or isn't listening on the destination port.
Which ZK is that log from? The one that you shut down to begin testing,
or one of the others? I see some very large time gaps in the log:
=========
2018-02-26 18:28:18,066 [myid:1] - INFO
[LearnerHandler-/172.31.73.122:57652:LearnerHandler@535] - Received
NEWLEADER-ACK message from 3
2018-02-26 18:56:19,656 [myid:1] - INFO [SyncThread:1:FileTxnLog@203] -
Creating new log file: log.400000001
=========
=========
2018-02-26 18:56:26,286 [myid:1] - WARN
[SendWorker:3:QuorumCnxManager$SendWorker@954] - Send worker leaving thread
2018-02-26 19:34:38,103 [myid:] - INFO [main:QuorumPeerConfig@134] -
Reading configuration from: /opt/zookeeper/current/bin/../conf/zoo.cfg
=========
The first gap is nearly half an hour, the second is more than half an hour.
What happens after the second gap appears to be a program startup. The
things logged at 18:56:nn *might* be program shutdown, but the log
doesn't explicitly say so. If it is a shutdown, then the program was
not running for quite a while.
I would definitely take this problem to the ZK mailing list. The
server-side problems don't involve Solr at all. You are having problems
with Solr, but they are completely within the ZK client code. Likely
both problems have the same root cause, so I'd start with the
server-side issues.
Solr 6.6.2 contains ZK version 3.4.10. Not the latest, but close.
Thanks,
Shawn