Re: Increasing Fault Tolerance of SOLR Cloud and Zookeeper

Erick Erickson Fri, 14 Dec 2018 13:29:48 -0800

The only substantive change to the _code_ was changing
these lines:

permission javax.security.auth.kerberos.ServicePermission
"zookeeper/127.0....@example.com", "initiate";
permission javax.security.auth.kerberos.ServicePermission
"zookeeper/127.0....@example.com", "accept";
to
permission javax.security.auth.kerberos.ServicePermission
"zookeeper/localh...@example.com", "initiate";
permission javax.security.auth.kerberos.ServicePermission
"zookeeper/localh...@example.com", "accept";


Again this was in our test framework, the "solr-tests.policy" file.

If you use Kerberos, you probably know a lot more about why you'd need
to do that than I do, and if you don't
use Kerberos you probably don't care.

When I say "substantive", I mean that  the progression from  Solr 6.6  was
3.4.10 -> 3.4.11->3.4.13
The associated  JIRAs are SOLR-11658 and SOLR-12727. 12727 has a
discussion of why the above
change is necessary, with links to the associated ZK JIRA.

So this looks like it'd be fine, with the usual caveat that nobody
that I know of has tested using
ZK 3.4.13 with Solr 6.6......

Best,
Erick
On Fri, Dec 14, 2018 at 10:01 AM Stephen Lewis Bianamara
<stephen.bianam...@gmail.com> wrote:
>
> Thanks Erick, you've been very helpful. One other question I have, is it
> reasonable to upgrade zookeeper on an in-place SOLR? I see that 12727
> appears to be verified with SOLR 7 modulo some test issues. For SOLR 6.6,
> would upgrading zookeeper to this version be advisable, or would you say
> that it would be risky? Of course I'll stage in a test environment, but
> it's hard to get the full story from just that...
>
> Thanks!
>
> On Thu, Dec 13, 2018 at 7:09 PM Erick Erickson <erickerick...@gmail.com>
> wrote:
>
> > bq. will the leader still report that there were two followers, even
> > if one of them bounced
> >
> > I really can't say, I took the ZK folks' at their word and upgraded.
> >
> > I should think that restarting your ZK nodes should reestablish that
> > they are all talking to each other, you may need to restart your Solr
> > instances to see it take effect.
> >
> > Sorry I can't be more help
> > Erick
> > On Thu, Dec 13, 2018 at 3:15 PM Stephen Lewis Bianamara
> > <stephen.bianam...@gmail.com> wrote:
> > >
> > > Thanks for the help Erick.
> > >
> > > This is an external zookeeper, running on three separate AWS instances
> > > separate from the instances hosting SOLR. I think I have some more
> > insight
> > > based on the bug you sent and some more log crawling.
> > >
> > > In October we had an instance retirement, wherein the instance was
> > > automatically stopped and restarted. We verified on that instance that
> > echo
> > > ruok | nc localhost <<PORT>> returned  imok . But, I just looked at that
> > > node with echo mntr | nc localhost <<PORT>>, and it appears to have never
> > > served a request! The first time I ran it there was 1 packet
> > sent/received,
> > > the next time 2 of each, the next time three.... It's reporting exactly
> > the
> > > number of times I run echo mntr | nc localhost <<PORT>> :) The other two
> > > machines each show millions of packets sent/received. It's quite weird
> > > because the leader zookeeper, reports 2 synced followers now, yet I
> > wonder
> > > why hasn't the node ever served a request if that's true. Quite bizarre.
> > >
> > > The three instances to talk over internal dns, I'm not totally sure if
> > the
> > > IP of the instance changed after its stop/start. I have seen this both
> > > change and not change on AWS, and I'm not sure what controls whether a
> > > stop/start changes the private IP. But I wonder if we can rule anything
> > > out; in the case of the dns bug 12727
> > > <https://issues.apache.org/jira/browse/SOLR-12727>, will the leader
> > still
> > > report that there were two followers, even if one of them bounced?
> > >
> > > Finally, this log appears on the zookeeper machine and appears to be the
> > > first sign of trouble Unexpected exception causing shutdown while sock
> > > still open. I'm guessing that what's happened is that our zk cluster has
> > a
> > > failed quorum in some way, likely from 12727, but the leader still thinks
> > > the other node is a follower. So I wonder what is the fix to this
> > > situation? Is it to one-by-one stop and restart the other two zookeeper
> > > processes?
> > >
> > > Thanks a bunch,
> > > Stephen
> > >
> > > On Thu, Dec 13, 2018 at 8:10 AM Erick Erickson <erickerick...@gmail.com>
> > > wrote:
> > >
> > > > Updates are disabled means that at least two of your three ZK nodes
> > > > are unreachable, which is worrisome.
> > > >
> > > > First:
> > > > That error is coming from Solr, but whether it's a Solr issue or a ZK
> > > > issue is ambiguous. Might be explained if the ZK nodes are under heavy
> > > > load. Question: Is this an external ZK ensemble? If so, what kind of
> > > > load are those machines under? If you're using the embedded ZK, then
> > > > stop-the-world GC could cause this.
> > > >
> > > > Second:
> > > > Yeah, increasing timeouts is one of the tricks, but tracking down  why
> > > > the response is so slow would be indicated in either case. I don't
> > > > have much confidence in this solution in this case though. Losing
> > > > quorum indicates something else as the culprit.
> > > >
> > > > Third:
> > > > Not quite. The  whole point of specifying the ensemble is that the ZK
> > > > client is smart enough to continue to function if quorum is present.
> > > > So it is _not_ the case that all the ZK instances need to be
> > > > reachable.
> > > >
> > > > On that topic, did you bounce your ZK servers or change them in any
> > > > other way? There's a known ZK issue when you reconfigure live ZK
> > > > ensembles, see: https://issues.apache.org/jira/browse/SOLR-12727
> > > >
> > > > Fourth:
> > > > See above.
> > > >
> > > > HTH,
> > > > Erick
> > > > On Wed, Dec 12, 2018 at 11:06 PM Stephen Lewis Bianamara
> > > > <stephen.bianam...@gmail.com> wrote:
> > > > >
> > > > > Hello SOLR Community!
> > > > >
> > > > > I have a SOLR cluster which recently hit this error (full error
> > > > > below). ""Cannot
> > > > > talk to ZooKeeper - Updates are disabled."" I'm running solr 6.6.2
> > and
> > > > > zookeeper 3.4.6.  The first time this happened, we replaced a node
> > within
> > > > > our cluster. The second time, we followed the advice in this post
> > > > > <
> > > >
> > http://lucene.472066.n3.nabble.com/Cannot-talk-to-ZooKeeper-Updates-are-disabled-Solr-6-3-0-td4311582.html
> > > > >
> > > > > and just restarted the SOLR service, which resolved the issue. I
> > traced
> > > > > this down (at least the second time) to this message: ""WARN
> > > > > (zkCallback-4-thread-31-processing-n:<<IP>>:<<PORT>>_solr) [ ]
> > > > > o.a.s.c.c.ConnectionManager Watcher
> > > > > org.apache.solr.common.cloud.ConnectionManager@4586a480 name:
> > > > > ZooKeeperConnection
> > Watcher:zookeeper-1.dns.domain.foo:1234,zookeeper-2.
> > > > > dns.domain.foo:1234,zookeeper-3. dns.domain.foo:1234 got event
> > > > WatchedEvent
> > > > > state:Disconnected type:None path:null path: null type: None"".
> > > > >
> > > > > I'm wondering a few things. First, can you help me understand what
> > this
> > > > > error means in this context? Did the Zookeepers themselves
> > experience an
> > > > > issue, or just the SOLR node trying to talk to the zookeepers? There
> > was
> > > > > only one SOLR node affected, which was the leader, and thus stopped
> > all
> > > > > writes. Any way to trace this to a specific resource limitation? Our
> > ZK
> > > > > cluster looks to be rather low utilization, but perhaps I'm missing
> > > > > something.
> > > > >
> > > > > The second, what steps can I take to make the SOLR-zookeeper
> > interaction
> > > > > more fault tolerant in general? It seems to me like we might want to
> > (a)
> > > > > Increase the Zookeeper SyncLimit to provide more flexibility within
> > the
> > > > ZK
> > > > > quorum, but this would only help if the issue was truly on the zk
> > side.
> > > > We
> > > > > could also increase the tolerance on the SOLR side of things; would
> > this
> > > > be
> > > > > controlled via the zkClientTimeout? Any other thoughts?
> > > > >
> > > > > The third, is there some more fault tolerant ZK Connection string
> > than
> > > > > listing out all three ZK nodes? I *think*, and please correct me if
> > I'm
> > > > > wrong, this will require all three ZK nodes to be reporting as
> > healthy
> > > > for
> > > > > the SOLR node to consider the connection healthy. Is that true? Maybe
> > > > > including all three does mean a 2/3 quorum only need be maintained.
> > If
> > > > the
> > > > > connection health is based on quorum, Is moving a busy cluster to 5
> > nodes
> > > > > for a 3/5 quorum desirable? Any other recommendations to make this
> > > > > healthier?
> > > > >
> > > > > Fourth, is any of the fault tolerance in this area improved in later
> > > > > SOLR/Zookeeper versions?
> > > > >
> > > > > Finally, this looks to be connected to this Jira issue
> > > > > <https://issues.apache.org/jira/browse/SOLR-3274>? The issue doesn't
> > > > appear
> > > > > to be very actionable unfortunately, but it appears people have
> > wondered
> > > > > this before. Are there any plans in the works to allow for recovery?
> > We
> > > > > found our ZK cluster was healthy and restarting the solr service
> > fixed
> > > > the
> > > > > issue, so it seems a reasonable feature to add auto-recovery on the
> > SOLR
> > > > > side when the ZK cluster returns to healthy. Would you agree?
> > > > >
> > > > > Thanks for your help!!
> > > > > Stephen
> > > >
> >

Re: Increasing Fault Tolerance of SOLR Cloud and Zookeeper

Reply via email to