Shawn - First, it's good to know that this is unusual behavior. That actually helps as it lets me know that I should keep digging.
Here are a couple of things that might help. In the configuration I am calling out all three ZK nodes. Here is the configuration of Solr: -DSTOP.KEY=solrrocks -DSTOP.PORT=7983 -Dhost=solr2 -Djetty.home=/opt/solr/server -Djetty.port=8983 -Dlog4j.configuration=file:/data/solr/log4j.properties -Dsolr.install.dir=/opt/solr -Dsolr.log.dir=/data/solr/logs -Dsolr.log.muteconsole -Dsolr.solr.home=/data/solr/data -Duser.timezone=UTC -DzkClientTimeout=15000 -DzkHost=<ZK Host internal IP 1>:2181,<ZK Host internal IP 2>:2181,<ZK Host internal IP 1>:2181 -XX:+CMSParallelRemarkEnabled -XX:+CMSScavengeBeforeRemark -XX:+ParallelRefProcEnabled -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDateStamps -XX:+PrintGCDetails-XX:+PrintGCTimeStamps-XX:+PrintHeapAtGC-XX:+PrintTenuringDistribution-XX:+UseCMSInitiatingOccupancyOnly-XX:+UseConcMarkSweepGC -XX:+UseGCLogFileRotation -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=50 -XX:CMSMaxAbortablePrecleanTime=6000 -XX:ConcGCThreads=4 -XX:GCLogFileSize=20M -XX:MaxTenuringThreshold=8 -XX:NewRatio=3 -XX:NumberOfGCLogFiles=9 -XX:OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh 8983 /data/solr/logs -XX:ParallelGCThreads=4 -XX:PretenureSizeThreshold=64m -XX:SurvivorRatio=4 -XX:TargetSurvivorRatio=90 -Xloggc:/data/solr/logs/solr_gc.log -Xms2G -Xmx6G -Xss1024k -Xss256k -verbose:gc Here are the types of Solr errors I receive when this happens. I was able to determine that it was not a security problem using telnet to connect to port 2181 on the ZK nodes. 2018-02-26 19:58:50.964 WARN (main-SendThread(<internal IP>:2181)) [ ] o.a.z.ClientCnxn Session 0x361d3ae3f1c0000 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141) 2018-02-26 19:58:52.894 WARN (main-SendThread(<internal IP>:2181)) [ ] o.a.z.ClientCnxn Session 0x361d3ae3f1c0000 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141) 2018-02-26 19:58:53.456 WARN (main-SendThread(<internal IP>:2181)) [ ] o.a.z.ClientCnxn Session 0x361d3ae3f1c0000 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141) And here are the errors when the ZK nodes are not able to connect to each other. 2018-02-26 19:57:25,554 [myid:2] - WARN [WorkerSender[myid=2]:QuorumCnxManager@588] - Cannot open channel to 1 at election address /<internal IP>:3888 java.net.ConnectException: Connection refused (Connection refused) at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:562) at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:538) at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:452) at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:433) at java.lang.Thread.run(Thread.java:748) 2018-02-26 19:57:25,554 [myid:2] - INFO [WorkerSender[myid=2]:QuorumPeer$QuorumServer@167] - Resolved hostname: <internal IP> to address: /<internal IP> 2018-02-26 19:57:25,554 [myid:2] - INFO [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message format version), 2 (n.leader), 0xa00000013 (n.zxid), 0x4 (n.round), LOOKING (n.state), 2 (n.sid), 0xa (n.peerEpoch) LOOKING (my state) 2018-02-26 19:57:25,556 [myid:2] - WARN [WorkerSender[myid=2]:QuorumCnxManager@588] - Cannot open channel to 3 at election address /<internal IP>:3888 java.net.ConnectException: Connection refused (Connection refused) at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:562) at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:538) at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:452) at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:433) at java.lang.Thread.run(Thread.java:748) 2018-02-26 19:57:25,556 [myid:2] - INFO [WorkerSender[myid=2]:QuorumPeer$QuorumServer@167] - Resolved hostname: <internal IP> to address: /<internal IP> 2018-02-26 19:57:25,756 [myid:2] - WARN [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@588] - Cannot open channel to 1 at election address /<internal IP>:3888 java.net.ConnectException: Connection refused (Connection refused) at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:562) at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:614) at org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:843) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:913) 2018-02-26 19:57:25,757 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer$QuorumServer@167] - Resolved hostname: <internal IP> to address: /<internal IP> 2018-02-26 19:57:25,812 [myid:2] - WARN [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@588] - Cannot open channel to 3 at election address /<internal IP>:3888 java.net.ConnectException: Connection refused (Connection refused) at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:562) at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:614) at org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:843) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:913) Let me know if you need anything else or if I should take my request over to ZooKeeper. Thanks. Jim K. On Tue, Feb 27, 2018 at 8:19 PM Shawn Heisey <apa...@elyograg.org> wrote: > On 2/27/2018 10:57 AM, James Keeney wrote: > > *1 - ZK ensemble not accepting return of node* > > Currently, when a ZK node in the ensemble goes down the ensemble is able > to > > do what it should do and keeps working. However when I bring the 3rd node > > back online the other two nodes reject connection requests from the 3rd > > node until I restart the nodes. The sequence is: > > > > 1. Bring 3rd node back on line > > 2. Restart follower in existing ensemble > > 3. Restart leader in existing ensemble > > > > When this is done the third node happily becomes part fo the ensemble. > > From what I understand, restarting the other nodes should not be > required. If everything is configured properly, I don't think that > should be happening, but I don't have deep ZK knowledge. > > > *2 - Solr nodes unable to connect* > > When setting up the cluster for the first time the ensemble rejects the > > solr connection requests until the ZK on the ZK ensemble members is > > restarted. > > <snip> > > > However, we have also seen that if we have a problem with one of the Solr > > nodes that requires restarting more than one node we have to restart ZK > to > > reconnect the nodes with thee ensemble again. > > These problems sound very weird too. I wish I had some idea, but > without logs showing what kind of errors are encountered, I have no idea > what's happening. > > None of these problems are in Solr code. Solr uses the ZooKeeper client > code without modification. All the ZK communication is done in ZK code, > initialized with the zkHost string and a few other config bits (like > zkClientTimeout) provided to Solr at startup. > > If you want to share the Solr log and the ZK server logs covering the > timeframe when the problems happen, maybe we can find something useful > and at least point you towards the problem, but even then, you may have > to talk to the ZooKeeper mailing list for real help, and they'll want > the same logs. > > Are you informing Solr about all three of your ZK hosts when you start > it up? That is a requirement. If the zkHost string you send to Solr > doesn't list all your servers, then the ZK client inside Solr will not > be able to fail over correctly. The version of ZK that Solr includes is > not able to dynamically change the servers that it talks to, and the > version of ZK that *does* have dynamic reconfiguration is still in > beta. Solr is not going to include ZK 3.5.x until they put out a stable > release. I don't know when they're going to do that. It could be soon, > or it could be several months out. The ZK project does NOT make > frequent releases. > > Thanks, > Shawn > > -- Jim Keeney President, FitterWeb E: j...@fitterweb.com M: 703-568-5887 *FitterWeb Consulting* *Are you lean and agile enough? *