The only substantive change to the _code_ was changing these lines: permission javax.security.auth.kerberos.ServicePermission "zookeeper/127.0....@example.com", "initiate"; permission javax.security.auth.kerberos.ServicePermission "zookeeper/127.0....@example.com", "accept"; to permission javax.security.auth.kerberos.ServicePermission "zookeeper/localh...@example.com", "initiate"; permission javax.security.auth.kerberos.ServicePermission "zookeeper/localh...@example.com", "accept";
Again this was in our test framework, the "solr-tests.policy" file. If you use Kerberos, you probably know a lot more about why you'd need to do that than I do, and if you don't use Kerberos you probably don't care. When I say "substantive", I mean that the progression from Solr 6.6 was 3.4.10 -> 3.4.11->3.4.13 The associated JIRAs are SOLR-11658 and SOLR-12727. 12727 has a discussion of why the above change is necessary, with links to the associated ZK JIRA. So this looks like it'd be fine, with the usual caveat that nobody that I know of has tested using ZK 3.4.13 with Solr 6.6...... Best, Erick On Fri, Dec 14, 2018 at 10:01 AM Stephen Lewis Bianamara <stephen.bianam...@gmail.com> wrote: > > Thanks Erick, you've been very helpful. One other question I have, is it > reasonable to upgrade zookeeper on an in-place SOLR? I see that 12727 > appears to be verified with SOLR 7 modulo some test issues. For SOLR 6.6, > would upgrading zookeeper to this version be advisable, or would you say > that it would be risky? Of course I'll stage in a test environment, but > it's hard to get the full story from just that... > > Thanks! > > On Thu, Dec 13, 2018 at 7:09 PM Erick Erickson <erickerick...@gmail.com> > wrote: > > > bq. will the leader still report that there were two followers, even > > if one of them bounced > > > > I really can't say, I took the ZK folks' at their word and upgraded. > > > > I should think that restarting your ZK nodes should reestablish that > > they are all talking to each other, you may need to restart your Solr > > instances to see it take effect. > > > > Sorry I can't be more help > > Erick > > On Thu, Dec 13, 2018 at 3:15 PM Stephen Lewis Bianamara > > <stephen.bianam...@gmail.com> wrote: > > > > > > Thanks for the help Erick. > > > > > > This is an external zookeeper, running on three separate AWS instances > > > separate from the instances hosting SOLR. I think I have some more > > insight > > > based on the bug you sent and some more log crawling. > > > > > > In October we had an instance retirement, wherein the instance was > > > automatically stopped and restarted. We verified on that instance that > > echo > > > ruok | nc localhost <<PORT>> returned imok . But, I just looked at that > > > node with echo mntr | nc localhost <<PORT>>, and it appears to have never > > > served a request! The first time I ran it there was 1 packet > > sent/received, > > > the next time 2 of each, the next time three.... It's reporting exactly > > the > > > number of times I run echo mntr | nc localhost <<PORT>> :) The other two > > > machines each show millions of packets sent/received. It's quite weird > > > because the leader zookeeper, reports 2 synced followers now, yet I > > wonder > > > why hasn't the node ever served a request if that's true. Quite bizarre. > > > > > > The three instances to talk over internal dns, I'm not totally sure if > > the > > > IP of the instance changed after its stop/start. I have seen this both > > > change and not change on AWS, and I'm not sure what controls whether a > > > stop/start changes the private IP. But I wonder if we can rule anything > > > out; in the case of the dns bug 12727 > > > <https://issues.apache.org/jira/browse/SOLR-12727>, will the leader > > still > > > report that there were two followers, even if one of them bounced? > > > > > > Finally, this log appears on the zookeeper machine and appears to be the > > > first sign of trouble Unexpected exception causing shutdown while sock > > > still open. I'm guessing that what's happened is that our zk cluster has > > a > > > failed quorum in some way, likely from 12727, but the leader still thinks > > > the other node is a follower. So I wonder what is the fix to this > > > situation? Is it to one-by-one stop and restart the other two zookeeper > > > processes? > > > > > > Thanks a bunch, > > > Stephen > > > > > > On Thu, Dec 13, 2018 at 8:10 AM Erick Erickson <erickerick...@gmail.com> > > > wrote: > > > > > > > Updates are disabled means that at least two of your three ZK nodes > > > > are unreachable, which is worrisome. > > > > > > > > First: > > > > That error is coming from Solr, but whether it's a Solr issue or a ZK > > > > issue is ambiguous. Might be explained if the ZK nodes are under heavy > > > > load. Question: Is this an external ZK ensemble? If so, what kind of > > > > load are those machines under? If you're using the embedded ZK, then > > > > stop-the-world GC could cause this. > > > > > > > > Second: > > > > Yeah, increasing timeouts is one of the tricks, but tracking down why > > > > the response is so slow would be indicated in either case. I don't > > > > have much confidence in this solution in this case though. Losing > > > > quorum indicates something else as the culprit. > > > > > > > > Third: > > > > Not quite. The whole point of specifying the ensemble is that the ZK > > > > client is smart enough to continue to function if quorum is present. > > > > So it is _not_ the case that all the ZK instances need to be > > > > reachable. > > > > > > > > On that topic, did you bounce your ZK servers or change them in any > > > > other way? There's a known ZK issue when you reconfigure live ZK > > > > ensembles, see: https://issues.apache.org/jira/browse/SOLR-12727 > > > > > > > > Fourth: > > > > See above. > > > > > > > > HTH, > > > > Erick > > > > On Wed, Dec 12, 2018 at 11:06 PM Stephen Lewis Bianamara > > > > <stephen.bianam...@gmail.com> wrote: > > > > > > > > > > Hello SOLR Community! > > > > > > > > > > I have a SOLR cluster which recently hit this error (full error > > > > > below). ""Cannot > > > > > talk to ZooKeeper - Updates are disabled."" I'm running solr 6.6.2 > > and > > > > > zookeeper 3.4.6. The first time this happened, we replaced a node > > within > > > > > our cluster. The second time, we followed the advice in this post > > > > > < > > > > > > http://lucene.472066.n3.nabble.com/Cannot-talk-to-ZooKeeper-Updates-are-disabled-Solr-6-3-0-td4311582.html > > > > > > > > > > and just restarted the SOLR service, which resolved the issue. I > > traced > > > > > this down (at least the second time) to this message: ""WARN > > > > > (zkCallback-4-thread-31-processing-n:<<IP>>:<<PORT>>_solr) [ ] > > > > > o.a.s.c.c.ConnectionManager Watcher > > > > > org.apache.solr.common.cloud.ConnectionManager@4586a480 name: > > > > > ZooKeeperConnection > > Watcher:zookeeper-1.dns.domain.foo:1234,zookeeper-2. > > > > > dns.domain.foo:1234,zookeeper-3. dns.domain.foo:1234 got event > > > > WatchedEvent > > > > > state:Disconnected type:None path:null path: null type: None"". > > > > > > > > > > I'm wondering a few things. First, can you help me understand what > > this > > > > > error means in this context? Did the Zookeepers themselves > > experience an > > > > > issue, or just the SOLR node trying to talk to the zookeepers? There > > was > > > > > only one SOLR node affected, which was the leader, and thus stopped > > all > > > > > writes. Any way to trace this to a specific resource limitation? Our > > ZK > > > > > cluster looks to be rather low utilization, but perhaps I'm missing > > > > > something. > > > > > > > > > > The second, what steps can I take to make the SOLR-zookeeper > > interaction > > > > > more fault tolerant in general? It seems to me like we might want to > > (a) > > > > > Increase the Zookeeper SyncLimit to provide more flexibility within > > the > > > > ZK > > > > > quorum, but this would only help if the issue was truly on the zk > > side. > > > > We > > > > > could also increase the tolerance on the SOLR side of things; would > > this > > > > be > > > > > controlled via the zkClientTimeout? Any other thoughts? > > > > > > > > > > The third, is there some more fault tolerant ZK Connection string > > than > > > > > listing out all three ZK nodes? I *think*, and please correct me if > > I'm > > > > > wrong, this will require all three ZK nodes to be reporting as > > healthy > > > > for > > > > > the SOLR node to consider the connection healthy. Is that true? Maybe > > > > > including all three does mean a 2/3 quorum only need be maintained. > > If > > > > the > > > > > connection health is based on quorum, Is moving a busy cluster to 5 > > nodes > > > > > for a 3/5 quorum desirable? Any other recommendations to make this > > > > > healthier? > > > > > > > > > > Fourth, is any of the fault tolerance in this area improved in later > > > > > SOLR/Zookeeper versions? > > > > > > > > > > Finally, this looks to be connected to this Jira issue > > > > > <https://issues.apache.org/jira/browse/SOLR-3274>? The issue doesn't > > > > appear > > > > > to be very actionable unfortunately, but it appears people have > > wondered > > > > > this before. Are there any plans in the works to allow for recovery? > > We > > > > > found our ZK cluster was healthy and restarting the solr service > > fixed > > > > the > > > > > issue, so it seems a reasonable feature to add auto-recovery on the > > SOLR > > > > > side when the ZK cluster returns to healthy. Would you agree? > > > > > > > > > > Thanks for your help!! > > > > > Stephen > > > > > >