We are running a small SolrCloud instance on AWS Solr : Version 5.3.1 ZooKeeper: Version 3.4.6
3 x ZooKeeper nodes (with higher limits and timeouts due to being on AWS) 3 x Solr Nodes (8 GB of memory each - 2 collections with 3 shards for each collection) Let's call the ZooKeeper nodes A, B and C. One of our ZooKeeper nodes (B) failed a health check and was replaced due to autoscaling , but during this time of failover our SolrCloud cluster became unavailable. All new connections to Solr were unable to connect complaining about connectivity issues and preexisting connections also had errors These errors happened for both querys and adds org.apache.solr.common.SolrException: Could not load collection from ZK:qa_us-east-1_here_account at org.apache.solr.client.solrj.impl.CloudSolrClient.getDocCollection(CloudSolrClient.java:1205) at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:837) at org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:805) at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135) at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:943) at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:958) at com.here.scbe.search.solr.SolrFacadeImpl.querySearchIndex(SolrFacadeImpl.java:183) at com.ovi.scbe.search.search.impl.SolrSearcher.searchInner(SolrSearcher.java:69) at com.ovi.scbe.search.search.impl.SolrSearcher.search(SolrSearcher.java:56) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342) at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342) org.apache.solr.common.SolrException: Could not load collection from ZK:qa_us-east-1_public_index at org.apache.solr.client.solrj.impl.CloudSolrClient.getDocCollection(CloudSolrClient.java:1205) at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:837) at org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:805) at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135) at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:107) at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:72) at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:86) at com.here.scbe.search.solr.SolrFacadeImpl.addToSearchIndex(SolrFacadeImpl.java:108) at com.ovi.scbe.search.index.impl.SolrIndexer.index(SolrIndexer.java:72) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342) at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342) I thought because we had configured SolrCloud to point at all three ZK nodes that the failure of one ZK node would be OK (since we still had a quorum). Did I misunderstand something about SolrCloud and its relationship with ZK? The weird thing now is that when the new ZooKeeper node (D) started up - after a few minutes we could connect to SolrCloud again even though we were still only pointing to A,B and C (not D). Any thoughts on why this also happened? Best, -Frank [cid:4BEEB30D-EF88-4787-B5F3-E6BF0E951BE3] Frank Kelly Principal Software Engineer Predictive Analytics Team (SCBE/HAC/CDA) HERE 5 Wayside Rd, Burlington, MA 01803, USA 42° 29' 7" N 71° 11' 32" W [cid:92482087-2AF4-4A90-9097-2CC3B0F9BFEB]<http://360.here.com/> [cid:4FC535C5-9858-4C8C-A8E3-E656910D0DCA] <https://twitter.com/here> [cid:527F4AAD-8F3D-4270-94A3-D69A29E2CCBF] <https://www.facebook.com/here> [cid:3147AF0F-7BA9-4466-A271-0AA00F6FABB4] <https://linkedin.com/company/heremaps> [cid:F0105D77-5164-4306-91EC-F1F9E6E31A85] <https://www.instagram.com/here>