Shawn -

First, it's good to know that this is unusual behavior. That actually helps
as it lets me know that I should keep digging.

Here are a couple of things that might help.

In the configuration I am calling out all three ZK nodes. Here is the
configuration of Solr:

-DSTOP.KEY=solrrocks
-DSTOP.PORT=7983
-Dhost=solr2
-Djetty.home=/opt/solr/server
-Djetty.port=8983
-Dlog4j.configuration=file:/data/solr/log4j.properties
-Dsolr.install.dir=/opt/solr
-Dsolr.log.dir=/data/solr/logs
-Dsolr.log.muteconsole
-Dsolr.solr.home=/data/solr/data
-Duser.timezone=UTC
-DzkClientTimeout=15000
-DzkHost=<ZK Host internal IP 1>:2181,<ZK Host internal IP 2>:2181,<ZK Host
internal IP 1>:2181
-XX:+CMSParallelRemarkEnabled
-XX:+CMSScavengeBeforeRemark
-XX:+ParallelRefProcEnabled
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails-XX:+PrintGCTimeStamps-XX:+PrintHeapAtGC-XX:+PrintTenuringDistribution-XX:+UseCMSInitiatingOccupancyOnly-XX:+UseConcMarkSweepGC
-XX:+UseGCLogFileRotation
-XX:+UseParNewGC
-XX:CMSInitiatingOccupancyFraction=50
-XX:CMSMaxAbortablePrecleanTime=6000
-XX:ConcGCThreads=4
-XX:GCLogFileSize=20M
-XX:MaxTenuringThreshold=8
-XX:NewRatio=3
-XX:NumberOfGCLogFiles=9
-XX:OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh 8983 /data/solr/logs
-XX:ParallelGCThreads=4
-XX:PretenureSizeThreshold=64m
-XX:SurvivorRatio=4
-XX:TargetSurvivorRatio=90
-Xloggc:/data/solr/logs/solr_gc.log
-Xms2G
-Xmx6G
-Xss1024k
-Xss256k
-verbose:gc


Here are the types of Solr errors I receive when this happens. I was able
to determine that it was not a security problem using telnet to connect to
port 2181 on the ZK nodes.

2018-02-26 19:58:50.964 WARN  (main-SendThread(<internal IP>:2181)) [   ]
o.a.z.ClientCnxn Session 0x361d3ae3f1c0000 for server null, unexpected
error, closing socket connection and attempting reconnect

java.net.ConnectException: Connection refused

at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)

at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)

at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)

2018-02-26 19:58:52.894 WARN  (main-SendThread(<internal IP>:2181)) [   ]
o.a.z.ClientCnxn Session 0x361d3ae3f1c0000 for server null, unexpected
error, closing socket connection and attempting reconnect

java.net.ConnectException: Connection refused

at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)

at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)

at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)

2018-02-26 19:58:53.456 WARN  (main-SendThread(<internal IP>:2181)) [   ]
o.a.z.ClientCnxn Session 0x361d3ae3f1c0000 for server null, unexpected
error, closing socket connection and attempting reconnect

java.net.ConnectException: Connection refused

at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)

at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)

at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)


And here are the errors when the ZK nodes are not able to connect to each
other.


2018-02-26 19:57:25,554 [myid:2] - WARN
[WorkerSender[myid=2]:QuorumCnxManager@588] - Cannot open channel to 1 at
election address /<internal IP>:3888

java.net.ConnectException: Connection refused (Connection refused)

at java.net.PlainSocketImpl.socketConnect(Native Method)

at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)

at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)

at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)

at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)

at java.net.Socket.connect(Socket.java:589)

at
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:562)

at
org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:538)

at
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:452)

at
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:433)

at java.lang.Thread.run(Thread.java:748)

2018-02-26 19:57:25,554 [myid:2] - INFO
[WorkerSender[myid=2]:QuorumPeer$QuorumServer@167] - Resolved
hostname: <internal
IP> to address: /<internal IP>

2018-02-26 19:57:25,554 [myid:2] - INFO
[WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message
format version), 2 (n.leader), 0xa00000013 (n.zxid), 0x4 (n.round), LOOKING
(n.state), 2 (n.sid), 0xa (n.peerEpoch) LOOKING (my state)

2018-02-26 19:57:25,556 [myid:2] - WARN
[WorkerSender[myid=2]:QuorumCnxManager@588] - Cannot open channel to 3 at
election address /<internal IP>:3888

java.net.ConnectException: Connection refused (Connection refused)

at java.net.PlainSocketImpl.socketConnect(Native Method)

at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)

at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)

at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)

at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)

at java.net.Socket.connect(Socket.java:589)

at
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:562)

at
org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:538)

at
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:452)

at
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:433)

at java.lang.Thread.run(Thread.java:748)

2018-02-26 19:57:25,556 [myid:2] - INFO
[WorkerSender[myid=2]:QuorumPeer$QuorumServer@167] - Resolved
hostname: <internal
IP> to address: /<internal IP>

2018-02-26 19:57:25,756 [myid:2] - WARN
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@588] - Cannot
open channel to 1 at election address /<internal IP>:3888

java.net.ConnectException: Connection refused (Connection refused)

at java.net.PlainSocketImpl.socketConnect(Native Method)

at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)

at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)

at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)

at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)

at java.net.Socket.connect(Socket.java:589)

at
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:562)

at
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:614)

at
org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:843)

at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:913)

2018-02-26 19:57:25,757 [myid:2] - INFO
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer$QuorumServer@167] -
Resolved hostname: <internal IP> to address: /<internal IP>

2018-02-26 19:57:25,812 [myid:2] - WARN
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@588] - Cannot
open channel to 3 at election address /<internal IP>:3888

java.net.ConnectException: Connection refused (Connection refused)

at java.net.PlainSocketImpl.socketConnect(Native Method)

at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)

at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)

at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)

at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)

at java.net.Socket.connect(Socket.java:589)

at
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:562)

at
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:614)

at
org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:843)

at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:913)


Let me know if you need anything else or if I should take my request over
to ZooKeeper.


Thanks.

Jim K.



On Tue, Feb 27, 2018 at 8:19 PM Shawn Heisey <apa...@elyograg.org> wrote:

> On 2/27/2018 10:57 AM, James Keeney wrote:
> > *1 - ZK ensemble not accepting return of node*
> > Currently, when a ZK node in the ensemble goes down the ensemble is able
> to
> > do what it should do and keeps working. However when I bring the 3rd node
> > back online the other two nodes reject connection requests from the 3rd
> > node until I restart the nodes. The sequence is:
> >
> >    1. Bring 3rd node back on line
> >    2. Restart follower in existing ensemble
> >    3. Restart leader in existing ensemble
> >
> > When this is done the third node happily becomes part fo the ensemble.
>
> From what I understand, restarting the other nodes should not be
> required.  If everything is configured properly, I don't think that
> should be happening, but I don't have deep ZK knowledge.
>
> > *2 - Solr nodes unable to connect*
> > When setting up the cluster for the first time the ensemble rejects the
> > solr connection requests until the ZK on the ZK ensemble members is
> > restarted.
>
> <snip>
>
> > However, we have also seen that if we have a problem with one of the Solr
> > nodes that requires restarting more than one node we have to restart ZK
> to
> > reconnect the nodes with thee ensemble again.
>
> These problems sound very weird too.  I wish I had some idea, but
> without logs showing what kind of errors are encountered, I have no idea
> what's happening.
>
> None of these problems are in Solr code.  Solr uses the ZooKeeper client
> code without modification.  All the ZK communication is done in ZK code,
> initialized with the zkHost string and a few other config bits (like
> zkClientTimeout) provided to Solr at startup.
>
> If you want to share the Solr log and the ZK server logs covering the
> timeframe when the problems happen, maybe we can find something useful
> and at least point you towards the problem, but even then, you may have
> to talk to the ZooKeeper mailing list for real help, and they'll want
> the same logs.
>
> Are you informing Solr about all three of your ZK hosts when you start
> it up?  That is a requirement.  If the zkHost string you send to Solr
> doesn't list all your servers, then the ZK client inside Solr will not
> be able to fail over correctly.  The version of ZK that Solr includes is
> not able to dynamically change the servers that it talks to, and the
> version of ZK that *does* have dynamic reconfiguration is still in
> beta.  Solr is not going to include ZK 3.5.x until they put out a stable
> release.  I don't know when they're going to do that.  It could be soon,
> or it could be several months out.  The ZK project does NOT make
> frequent releases.
>
> Thanks,
> Shawn
>
> --
Jim Keeney
President, FitterWeb
E: j...@fitterweb.com
M: 703-568-5887

*FitterWeb Consulting*
*Are you lean and agile enough? *

Reply via email to