Re: Increasing Fault Tolerance of SOLR Cloud and Zookeeper

Erick Erickson Thu, 13 Dec 2018 08:10:47 -0800

Updates are disabled means that at least two of your three ZK nodes
are unreachable, which is worrisome.


First:
That error is coming from Solr, but whether it's a Solr issue or a ZK
issue is ambiguous. Might be explained if the ZK nodes are under heavy
load. Question: Is this an external ZK ensemble? If so, what kind of
load are those machines under? If you're using the embedded ZK, then
stop-the-world GC could cause this.

Second:
Yeah, increasing timeouts is one of the tricks, but tracking down  why
the response is so slow would be indicated in either case. I don't
have much confidence in this solution in this case though. Losing
quorum indicates something else as the culprit.

Third:
Not quite. The  whole point of specifying the ensemble is that the ZK
client is smart enough to continue to function if quorum is present.
So it is _not_ the case that all the ZK instances need to be
reachable.

On that topic, did you bounce your ZK servers or change them in any
other way? There's a known ZK issue when you reconfigure live ZK
ensembles, see: https://issues.apache.org/jira/browse/SOLR-12727

Fourth:
See above.

HTH,
Erick
On Wed, Dec 12, 2018 at 11:06 PM Stephen Lewis Bianamara
<stephen.bianam...@gmail.com> wrote:
>
> Hello SOLR Community!
>
> I have a SOLR cluster which recently hit this error (full error
> below). ""Cannot
> talk to ZooKeeper - Updates are disabled."" I'm running solr 6.6.2 and
> zookeeper 3.4.6.  The first time this happened, we replaced a node within
> our cluster. The second time, we followed the advice in this post
> <http://lucene.472066.n3.nabble.com/Cannot-talk-to-ZooKeeper-Updates-are-disabled-Solr-6-3-0-td4311582.html>
> and just restarted the SOLR service, which resolved the issue. I traced
> this down (at least the second time) to this message: ""WARN
> (zkCallback-4-thread-31-processing-n:<<IP>>:<<PORT>>_solr) [ ]
> o.a.s.c.c.ConnectionManager Watcher
> org.apache.solr.common.cloud.ConnectionManager@4586a480 name:
> ZooKeeperConnection Watcher:zookeeper-1.dns.domain.foo:1234,zookeeper-2.
> dns.domain.foo:1234,zookeeper-3. dns.domain.foo:1234 got event WatchedEvent
> state:Disconnected type:None path:null path: null type: None"".
>
> I'm wondering a few things. First, can you help me understand what this
> error means in this context? Did the Zookeepers themselves experience an
> issue, or just the SOLR node trying to talk to the zookeepers? There was
> only one SOLR node affected, which was the leader, and thus stopped all
> writes. Any way to trace this to a specific resource limitation? Our ZK
> cluster looks to be rather low utilization, but perhaps I'm missing
> something.
>
> The second, what steps can I take to make the SOLR-zookeeper interaction
> more fault tolerant in general? It seems to me like we might want to (a)
> Increase the Zookeeper SyncLimit to provide more flexibility within the ZK
> quorum, but this would only help if the issue was truly on the zk side. We
> could also increase the tolerance on the SOLR side of things; would this be
> controlled via the zkClientTimeout? Any other thoughts?
>
> The third, is there some more fault tolerant ZK Connection string than
> listing out all three ZK nodes? I *think*, and please correct me if I'm
> wrong, this will require all three ZK nodes to be reporting as healthy for
> the SOLR node to consider the connection healthy. Is that true? Maybe
> including all three does mean a 2/3 quorum only need be maintained. If the
> connection health is based on quorum, Is moving a busy cluster to 5 nodes
> for a 3/5 quorum desirable? Any other recommendations to make this
> healthier?
>
> Fourth, is any of the fault tolerance in this area improved in later
> SOLR/Zookeeper versions?
>
> Finally, this looks to be connected to this Jira issue
> <https://issues.apache.org/jira/browse/SOLR-3274>? The issue doesn't appear
> to be very actionable unfortunately, but it appears people have wondered
> this before. Are there any plans in the works to allow for recovery? We
> found our ZK cluster was healthy and restarting the solr service fixed the
> issue, so it seems a reasonable feature to add auto-recovery on the SOLR
> side when the ZK cluster returns to healthy. Would you agree?
>
> Thanks for your help!!
> Stephen

Re: Increasing Fault Tolerance of SOLR Cloud and Zookeeper

Reply via email to