Updates are disabled means that at least two of your three ZK nodes are unreachable, which is worrisome.
First: That error is coming from Solr, but whether it's a Solr issue or a ZK issue is ambiguous. Might be explained if the ZK nodes are under heavy load. Question: Is this an external ZK ensemble? If so, what kind of load are those machines under? If you're using the embedded ZK, then stop-the-world GC could cause this. Second: Yeah, increasing timeouts is one of the tricks, but tracking down why the response is so slow would be indicated in either case. I don't have much confidence in this solution in this case though. Losing quorum indicates something else as the culprit. Third: Not quite. The whole point of specifying the ensemble is that the ZK client is smart enough to continue to function if quorum is present. So it is _not_ the case that all the ZK instances need to be reachable. On that topic, did you bounce your ZK servers or change them in any other way? There's a known ZK issue when you reconfigure live ZK ensembles, see: https://issues.apache.org/jira/browse/SOLR-12727 Fourth: See above. HTH, Erick On Wed, Dec 12, 2018 at 11:06 PM Stephen Lewis Bianamara <stephen.bianam...@gmail.com> wrote: > > Hello SOLR Community! > > I have a SOLR cluster which recently hit this error (full error > below). ""Cannot > talk to ZooKeeper - Updates are disabled."" I'm running solr 6.6.2 and > zookeeper 3.4.6. The first time this happened, we replaced a node within > our cluster. The second time, we followed the advice in this post > <http://lucene.472066.n3.nabble.com/Cannot-talk-to-ZooKeeper-Updates-are-disabled-Solr-6-3-0-td4311582.html> > and just restarted the SOLR service, which resolved the issue. I traced > this down (at least the second time) to this message: ""WARN > (zkCallback-4-thread-31-processing-n:<<IP>>:<<PORT>>_solr) [ ] > o.a.s.c.c.ConnectionManager Watcher > org.apache.solr.common.cloud.ConnectionManager@4586a480 name: > ZooKeeperConnection Watcher:zookeeper-1.dns.domain.foo:1234,zookeeper-2. > dns.domain.foo:1234,zookeeper-3. dns.domain.foo:1234 got event WatchedEvent > state:Disconnected type:None path:null path: null type: None"". > > I'm wondering a few things. First, can you help me understand what this > error means in this context? Did the Zookeepers themselves experience an > issue, or just the SOLR node trying to talk to the zookeepers? There was > only one SOLR node affected, which was the leader, and thus stopped all > writes. Any way to trace this to a specific resource limitation? Our ZK > cluster looks to be rather low utilization, but perhaps I'm missing > something. > > The second, what steps can I take to make the SOLR-zookeeper interaction > more fault tolerant in general? It seems to me like we might want to (a) > Increase the Zookeeper SyncLimit to provide more flexibility within the ZK > quorum, but this would only help if the issue was truly on the zk side. We > could also increase the tolerance on the SOLR side of things; would this be > controlled via the zkClientTimeout? Any other thoughts? > > The third, is there some more fault tolerant ZK Connection string than > listing out all three ZK nodes? I *think*, and please correct me if I'm > wrong, this will require all three ZK nodes to be reporting as healthy for > the SOLR node to consider the connection healthy. Is that true? Maybe > including all three does mean a 2/3 quorum only need be maintained. If the > connection health is based on quorum, Is moving a busy cluster to 5 nodes > for a 3/5 quorum desirable? Any other recommendations to make this > healthier? > > Fourth, is any of the fault tolerance in this area improved in later > SOLR/Zookeeper versions? > > Finally, this looks to be connected to this Jira issue > <https://issues.apache.org/jira/browse/SOLR-3274>? The issue doesn't appear > to be very actionable unfortunately, but it appears people have wondered > this before. Are there any plans in the works to allow for recovery? We > found our ZK cluster was healthy and restarting the solr service fixed the > issue, so it seems a reasonable feature to add auto-recovery on the SOLR > side when the ZK cluster returns to healthy. Would you agree? > > Thanks for your help!! > Stephen