On 2/28/2018 6:54 AM, James Keeney wrote:
I did notice one thing in the logs:

2018-02-28 13:21:58,932 [myid:1] - INFO [/172.31.86.130:3888:QuorumCnxManager$Listener@743] - *Received connection request /172.31.73.122:34804 <http://172.31.73.122:34804>*

<snip>

When the restarted node attempts to reconnect with the ensemble it looks like it does so on a random port. Could it be that nodes in the ensemble are rejecting the new request to rejoin because they are not listening on that port? And why is it not requesting on 3888:2888? This is confusing to me.

That appears to be the source port.  Which is generally going to be a very high port and semi-unpredictable.Normal TCP operation.

I have attached a ZK log and a SOLR log. You can watch the whole progression in the ZK log as it goes from happy to disconnected to trying to reconnect to part of the ensemble when the other nodes are restarted. Seems like ZK holds onto a state based on the original ensemble interactions and that state prevents the node from rejoining the ensemble. The state is then lost with the restart which allows the members to re-establish connection and form the new ensemble.

What timestamps correspond to the actions you took?

Lots and lots of connections refused.  Unless there's something preventing network access, I would only expect connections to be refused if the software isn't running or isn't listening on the destination port.

Which ZK is that log from?  The one that you shut down to begin testing, or one of the others?  I see some very large time gaps in the log:

=========
2018-02-26 18:28:18,066 [myid:1] - INFO [LearnerHandler-/172.31.73.122:57652:LearnerHandler@535] - Received NEWLEADER-ACK message from 3 2018-02-26 18:56:19,656 [myid:1] - INFO [SyncThread:1:FileTxnLog@203] - Creating new log file: log.400000001
=========

=========
2018-02-26 18:56:26,286 [myid:1] - WARN [SendWorker:3:QuorumCnxManager$SendWorker@954] - Send worker leaving thread 2018-02-26 19:34:38,103 [myid:] - INFO [main:QuorumPeerConfig@134] - Reading configuration from: /opt/zookeeper/current/bin/../conf/zoo.cfg
=========

The first gap is nearly half an hour, the second is more than half an hour.

What happens after the second gap appears to be a program startup.  The things logged at 18:56:nn *might* be program shutdown, but the log doesn't explicitly say so.  If it is a shutdown, then the program was not running for quite a while.

I would definitely take this problem to the ZK mailing list.  The server-side problems don't involve Solr at all.  You are having problems with Solr, but they are completely within the ZK client code.  Likely both problems have the same root cause, so I'd start with the server-side issues.

Solr 6.6.2 contains ZK version 3.4.10.  Not the latest, but close.

Thanks,
Shawn

Reply via email to