Increasing Fault Tolerance of SOLR Cloud and Zookeeper

Stephen Lewis Bianamara Wed, 12 Dec 2018 23:07:20 -0800

Hello SOLR Community!

I have a SOLR cluster which recently hit this error (full error
below). ""Cannot
talk to ZooKeeper - Updates are disabled."" I'm running solr 6.6.2 and
zookeeper 3.4.6.  The first time this happened, we replaced a node within
our cluster. The second time, we followed the advice in this post
<http://lucene.472066.n3.nabble.com/Cannot-talk-to-ZooKeeper-Updates-are-disabled-Solr-6-3-0-td4311582.html>
and just restarted the SOLR service, which resolved the issue. I traced
this down (at least the second time) to this message: ""WARN
(zkCallback-4-thread-31-processing-n:<<IP>>:<<PORT>>_solr) [ ]
o.a.s.c.c.ConnectionManager Watcher
org.apache.solr.common.cloud.ConnectionManager@4586a480 name:
ZooKeeperConnection Watcher:zookeeper-1.dns.domain.foo:1234,zookeeper-2.
dns.domain.foo:1234,zookeeper-3. dns.domain.foo:1234 got event WatchedEvent
state:Disconnected type:None path:null path: null type: None"".


I'm wondering a few things. First, can you help me understand what this
error means in this context? Did the Zookeepers themselves experience an
issue, or just the SOLR node trying to talk to the zookeepers? There was
only one SOLR node affected, which was the leader, and thus stopped all
writes. Any way to trace this to a specific resource limitation? Our ZK
cluster looks to be rather low utilization, but perhaps I'm missing
something.

The second, what steps can I take to make the SOLR-zookeeper interaction
more fault tolerant in general? It seems to me like we might want to (a)
Increase the Zookeeper SyncLimit to provide more flexibility within the ZK
quorum, but this would only help if the issue was truly on the zk side. We
could also increase the tolerance on the SOLR side of things; would this be
controlled via the zkClientTimeout? Any other thoughts?

The third, is there some more fault tolerant ZK Connection string than
listing out all three ZK nodes? I *think*, and please correct me if I'm
wrong, this will require all three ZK nodes to be reporting as healthy for
the SOLR node to consider the connection healthy. Is that true? Maybe
including all three does mean a 2/3 quorum only need be maintained. If the
connection health is based on quorum, Is moving a busy cluster to 5 nodes
for a 3/5 quorum desirable? Any other recommendations to make this
healthier?

Fourth, is any of the fault tolerance in this area improved in later
SOLR/Zookeeper versions?

Finally, this looks to be connected to this Jira issue
<https://issues.apache.org/jira/browse/SOLR-3274>? The issue doesn't appear
to be very actionable unfortunately, but it appears people have wondered
this before. Are there any plans in the works to allow for recovery? We
found our ZK cluster was healthy and restarting the solr service fixed the
issue, so it seems a reasonable feature to add auto-recovery on the SOLR
side when the ZK cluster returns to healthy. Would you agree?

Thanks for your help!!
Stephen

Increasing Fault Tolerance of SOLR Cloud and Zookeeper

Reply via email to