Hi.

I have a big issue with the Solr Cloud environment.

After developing a solution based on SolrCloud into a safe environment (safe
means stable and fast internet connection, moderate number of documents,
etc), the solution reached the production.

Because in production we have a lack of hardware, our environment contains
the following:
1 Zookeeper instance
2 collections (collection1 and collection2)
3 Sol'r nodes (node1, node2 and node3)

collection1 is shared between node1 and node2
collection2 is used by node3

On the central machine I have the zookeeper instance, node1 and node3
instances.
I have another machine which contains the node2 instance.

Between these 2 machines, the internet connection is pretty unstable. From
time to time, the connection is lost and the node2 become unavailable.
Sometimes after the connection is restored, the node2 connects successfully
to Zookeeper but sometimes it does not.

We tried to reload the node, to initialize again with some custom procedure,
but nothing worked.
The only solution that we have is to restart the node2. When we do this, the
node is registered to Zookeeper again and all works great, for 1 or 2 days.

After intensive debugging and google/forums searches, I think that the
problem could be from Zookeeper configurations.
This thought is because I have read the Zookeeper documentation and saw
this:

*syncLimit
(No Java system property)

Amount of time, in ticks (see tickTime), to allow followers to sync with
ZooKeeper. If followers fall too far behind a leader, they will be dropped.*
... so seems that my Sol'r node is dropped from Zookeeper. I don't know if
I've understood correctly this property.

Here is my zoo.cfg:

###### zoo.cfg ######
# The number of milliseconds of each tick
tickTime=1000

# The number of ticks that the initial synchronization phase can take
initLimit=10

# The number of ticks that can pass between 
# sending a request and getting an acknowledgement
syncLimit=5

# the directory where the snapshot is stored.
# Choose appropriately for your environment
dataDir=data/zk1

# the port at which the clients will connect
clientPort=2181

# the directory where transaction log is stored.
# this parameter provides dedicated log device for ZooKeeper
dataLogDir=log/zk1
###### zoo.cfg(end) ######



FYI, almost every time when the connection is lost, some of the following
errors appear on node2 machine:

org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /overseer_elect

org.apache.solr.common.SolrException: There was a problem finding the leader
in zk

org.apache.solr.common.SolrException: Cannot talk to ZooKeeper - Updates are
disabled.

org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for
/collections/collection1/leaders/shard1
There was a problem finding the leader in zk:java.lang.InterruptedException:
sleep interrupted

org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /overseer_elect/election


I'm sorry for the long post.
Thank you,
Andrei



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-node-removed-from-zookeeper-tp4236931.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to