Hello SOLR Community! I have a SOLR cluster which recently hit this error (full error below). ""Cannot talk to ZooKeeper - Updates are disabled."" I'm running solr 6.6.2 and zookeeper 3.4.6. The first time this happened, we replaced a node within our cluster. The second time, we followed the advice in this post <http://lucene.472066.n3.nabble.com/Cannot-talk-to-ZooKeeper-Updates-are-disabled-Solr-6-3-0-td4311582.html> and just restarted the SOLR service, which resolved the issue. I traced this down (at least the second time) to this message: ""WARN (zkCallback-4-thread-31-processing-n:<<IP>>:<<PORT>>_solr) [ ] o.a.s.c.c.ConnectionManager Watcher org.apache.solr.common.cloud.ConnectionManager@4586a480 name: ZooKeeperConnection Watcher:zookeeper-1.dns.domain.foo:1234,zookeeper-2. dns.domain.foo:1234,zookeeper-3. dns.domain.foo:1234 got event WatchedEvent state:Disconnected type:None path:null path: null type: None"".
I'm wondering a few things. First, can you help me understand what this error means in this context? Did the Zookeepers themselves experience an issue, or just the SOLR node trying to talk to the zookeepers? There was only one SOLR node affected, which was the leader, and thus stopped all writes. Any way to trace this to a specific resource limitation? Our ZK cluster looks to be rather low utilization, but perhaps I'm missing something. The second, what steps can I take to make the SOLR-zookeeper interaction more fault tolerant in general? It seems to me like we might want to (a) Increase the Zookeeper SyncLimit to provide more flexibility within the ZK quorum, but this would only help if the issue was truly on the zk side. We could also increase the tolerance on the SOLR side of things; would this be controlled via the zkClientTimeout? Any other thoughts? The third, is there some more fault tolerant ZK Connection string than listing out all three ZK nodes? I *think*, and please correct me if I'm wrong, this will require all three ZK nodes to be reporting as healthy for the SOLR node to consider the connection healthy. Is that true? Maybe including all three does mean a 2/3 quorum only need be maintained. If the connection health is based on quorum, Is moving a busy cluster to 5 nodes for a 3/5 quorum desirable? Any other recommendations to make this healthier? Fourth, is any of the fault tolerance in this area improved in later SOLR/Zookeeper versions? Finally, this looks to be connected to this Jira issue <https://issues.apache.org/jira/browse/SOLR-3274>? The issue doesn't appear to be very actionable unfortunately, but it appears people have wondered this before. Are there any plans in the works to allow for recovery? We found our ZK cluster was healthy and restarting the solr service fixed the issue, so it seems a reasonable feature to add auto-recovery on the SOLR side when the ZK cluster returns to healthy. Would you agree? Thanks for your help!! Stephen