Re: Increasing Fault Tolerance of SOLR Cloud and Zookeeper

Erick Erickson Thu, 13 Dec 2018 19:09:51 -0800

bq. will the leader still report that there were two followers, even
if one of them bounced


I really can't say, I took the ZK folks' at their word and upgraded.

I should think that restarting your ZK nodes should reestablish that
they are all talking to each other, you may need to restart your Solr
instances to see it take effect.

Sorry I can't be more help
Erick
On Thu, Dec 13, 2018 at 3:15 PM Stephen Lewis Bianamara
<stephen.bianam...@gmail.com> wrote:
>
> Thanks for the help Erick.
>
> This is an external zookeeper, running on three separate AWS instances
> separate from the instances hosting SOLR. I think I have some more insight
> based on the bug you sent and some more log crawling.
>
> In October we had an instance retirement, wherein the instance was
> automatically stopped and restarted. We verified on that instance that echo
> ruok | nc localhost <<PORT>> returned  imok . But, I just looked at that
> node with echo mntr | nc localhost <<PORT>>, and it appears to have never
> served a request! The first time I ran it there was 1 packet sent/received,
> the next time 2 of each, the next time three.... It's reporting exactly the
> number of times I run echo mntr | nc localhost <<PORT>> :) The other two
> machines each show millions of packets sent/received. It's quite weird
> because the leader zookeeper, reports 2 synced followers now, yet I wonder
> why hasn't the node ever served a request if that's true. Quite bizarre.
>
> The three instances to talk over internal dns, I'm not totally sure if the
> IP of the instance changed after its stop/start. I have seen this both
> change and not change on AWS, and I'm not sure what controls whether a
> stop/start changes the private IP. But I wonder if we can rule anything
> out; in the case of the dns bug 12727
> <https://issues.apache.org/jira/browse/SOLR-12727>, will the leader still
> report that there were two followers, even if one of them bounced?
>
> Finally, this log appears on the zookeeper machine and appears to be the
> first sign of trouble Unexpected exception causing shutdown while sock
> still open. I'm guessing that what's happened is that our zk cluster has a
> failed quorum in some way, likely from 12727, but the leader still thinks
> the other node is a follower. So I wonder what is the fix to this
> situation? Is it to one-by-one stop and restart the other two zookeeper
> processes?
>
> Thanks a bunch,
> Stephen
>
> On Thu, Dec 13, 2018 at 8:10 AM Erick Erickson <erickerick...@gmail.com>
> wrote:
>
> > Updates are disabled means that at least two of your three ZK nodes
> > are unreachable, which is worrisome.
> >
> > First:
> > That error is coming from Solr, but whether it's a Solr issue or a ZK
> > issue is ambiguous. Might be explained if the ZK nodes are under heavy
> > load. Question: Is this an external ZK ensemble? If so, what kind of
> > load are those machines under? If you're using the embedded ZK, then
> > stop-the-world GC could cause this.
> >
> > Second:
> > Yeah, increasing timeouts is one of the tricks, but tracking down  why
> > the response is so slow would be indicated in either case. I don't
> > have much confidence in this solution in this case though. Losing
> > quorum indicates something else as the culprit.
> >
> > Third:
> > Not quite. The  whole point of specifying the ensemble is that the ZK
> > client is smart enough to continue to function if quorum is present.
> > So it is _not_ the case that all the ZK instances need to be
> > reachable.
> >
> > On that topic, did you bounce your ZK servers or change them in any
> > other way? There's a known ZK issue when you reconfigure live ZK
> > ensembles, see: https://issues.apache.org/jira/browse/SOLR-12727
> >
> > Fourth:
> > See above.
> >
> > HTH,
> > Erick
> > On Wed, Dec 12, 2018 at 11:06 PM Stephen Lewis Bianamara
> > <stephen.bianam...@gmail.com> wrote:
> > >
> > > Hello SOLR Community!
> > >
> > > I have a SOLR cluster which recently hit this error (full error
> > > below). ""Cannot
> > > talk to ZooKeeper - Updates are disabled."" I'm running solr 6.6.2 and
> > > zookeeper 3.4.6.  The first time this happened, we replaced a node within
> > > our cluster. The second time, we followed the advice in this post
> > > <
> > http://lucene.472066.n3.nabble.com/Cannot-talk-to-ZooKeeper-Updates-are-disabled-Solr-6-3-0-td4311582.html
> > >
> > > and just restarted the SOLR service, which resolved the issue. I traced
> > > this down (at least the second time) to this message: ""WARN
> > > (zkCallback-4-thread-31-processing-n:<<IP>>:<<PORT>>_solr) [ ]
> > > o.a.s.c.c.ConnectionManager Watcher
> > > org.apache.solr.common.cloud.ConnectionManager@4586a480 name:
> > > ZooKeeperConnection Watcher:zookeeper-1.dns.domain.foo:1234,zookeeper-2.
> > > dns.domain.foo:1234,zookeeper-3. dns.domain.foo:1234 got event
> > WatchedEvent
> > > state:Disconnected type:None path:null path: null type: None"".
> > >
> > > I'm wondering a few things. First, can you help me understand what this
> > > error means in this context? Did the Zookeepers themselves experience an
> > > issue, or just the SOLR node trying to talk to the zookeepers? There was
> > > only one SOLR node affected, which was the leader, and thus stopped all
> > > writes. Any way to trace this to a specific resource limitation? Our ZK
> > > cluster looks to be rather low utilization, but perhaps I'm missing
> > > something.
> > >
> > > The second, what steps can I take to make the SOLR-zookeeper interaction
> > > more fault tolerant in general? It seems to me like we might want to (a)
> > > Increase the Zookeeper SyncLimit to provide more flexibility within the
> > ZK
> > > quorum, but this would only help if the issue was truly on the zk side.
> > We
> > > could also increase the tolerance on the SOLR side of things; would this
> > be
> > > controlled via the zkClientTimeout? Any other thoughts?
> > >
> > > The third, is there some more fault tolerant ZK Connection string than
> > > listing out all three ZK nodes? I *think*, and please correct me if I'm
> > > wrong, this will require all three ZK nodes to be reporting as healthy
> > for
> > > the SOLR node to consider the connection healthy. Is that true? Maybe
> > > including all three does mean a 2/3 quorum only need be maintained. If
> > the
> > > connection health is based on quorum, Is moving a busy cluster to 5 nodes
> > > for a 3/5 quorum desirable? Any other recommendations to make this
> > > healthier?
> > >
> > > Fourth, is any of the fault tolerance in this area improved in later
> > > SOLR/Zookeeper versions?
> > >
> > > Finally, this looks to be connected to this Jira issue
> > > <https://issues.apache.org/jira/browse/SOLR-3274>? The issue doesn't
> > appear
> > > to be very actionable unfortunately, but it appears people have wondered
> > > this before. Are there any plans in the works to allow for recovery? We
> > > found our ZK cluster was healthy and restarting the solr service fixed
> > the
> > > issue, so it seems a reasonable feature to add auto-recovery on the SOLR
> > > side when the ZK cluster returns to healthy. Would you agree?
> > >
> > > Thanks for your help!!
> > > Stephen
> >

Re: Increasing Fault Tolerance of SOLR Cloud and Zookeeper

Reply via email to