Any thoughts on this? Hoping for just a quick 1) Yes - once ZooKeeper loses a Quorum you need to restart Solr and your SolrJ Client 2) No - that¹s not expected behavior - Solr and SolrJ should recover - please file a JIRA issue
Cheers! Frank Kelly Principal Software Engineer Predictive Analytics Team (SCBE/HAC/CDA) HERE 5 Wayside Rd, Burlington, MA 01803, USA 42° 29' 7" N 71° 11' 32² W <http://360.here.com/> <https://twitter.com/here> <https://www.facebook.com/here> <https://linkedin.com/company/heremaps> <https://www.instagram.com/here> On 3/16/16, 8:54 AM, "Kelly, Frank" <frank.ke...@here.com> wrote: ><This time without images :-) > > >Just wondering if my observation of SolrCloud behavior after ZooKeeper >loses a quorum is normal or to-be-expected > >Version of Solr: 5.3.1 >Version of ZooKeeper: 3.4.7 >Using SolrCloud with external ZooKeeper >Deployed on AWS > >Our Solr cluster has 3 nodes > >Our Zookeeper ensemble consists of three nodes with the same config using >DNS names e.g. > >$ more ../conf/zoo.cfg >tickTime=2000 >dataDir=/var/zookeeper >dataLogDir=/var/log/zookeeper >clientPort=2181 >initLimit=10 >syncLimit=5 >standaloneEnabled=false >server.1=zookeeper1.qa.eu-west-1.mysearch.com:2888:3888 >server.2=zookeeper2.qa.eu-west-1.mysearch.com:2888:3888 >server.3=zookeeper3.qa.eu-west-1.mysearch.com:2888:3888 > >If we terminate one of the zookeeper nodes we get a ZK election (and I >think) a quorum is maintained. >Operation continues OK and we detect the terminated instance and relaunch >a new ZK node which comes up fine > >If we terminate two of the ZK nodes we lose a quorum and then we observe >the following > >1.1) Admin UI shows an error that it is unable to contact ZooKeeper >³Could not connect to ZooKeeper" > >1.2) SolrJ returns the following > >org.apache.solr.common.SolrException: Could not load collection from >ZK:qa_eu-west-1_public_index >at >org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader >.java:850) >at >org.apache.solr.common.cloud.ZkStateReader$7.get(ZkStateReader.java:515) >at >org.apache.solr.client.solrj.impl.CloudSolrClient.getDocCollection(CloudSo >lrClient.java:1205) >at >org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleS >tate(CloudSolrClient.java:837) >at >org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient. >java:805) >at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135) >at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:107) >at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:72) >at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:86) >at >com.here.scbe.search.solr.SolrFacadeImpl.addToSearchIndex(SolrFacadeImpl.j >ava:112) >Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: >KeeperErrorCode = ConnectionLoss for >/collections/qa_eu-west-1_public_index/state.json >at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) >at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) >at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) >at >org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345) >at >org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342) >at >org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.ja >va:61) >at >org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342) >at >org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader >.java:841) >... 24 more > >This makes sense based on our understanding. >When our AutoScale groups launch two new ZooKeeper nodes, initialize >them, fix the DNS etc. we regain a quorum but at this point > >2.1) Admin UI shows the shards as ³GONE² (all greyed out) > >2.2) SolrJ returns the same error even though the ZooKeeper DNS names are >now bound to new IP addresses > >So at this point I restart the Solr nodes. At this point then > >3.1) Admin UI shows the collections as OK (all shards are green) yeah >the nodes are back! > >3.2) SolrJ Client still shows the same error namely > >org.apache.solr.common.SolrException: Could not load collection from >ZK:qa_eu-west-1_here_account >at >org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader >.java:850) >at >org.apache.solr.common.cloud.ZkStateReader$7.get(ZkStateReader.java:515) >at >org.apache.solr.client.solrj.impl.CloudSolrClient.getDocCollection(CloudSo >lrClient.java:1205) >at >org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleS >tate(CloudSolrClient.java:837) >at >org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient. >java:805) >at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135) >at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:825) >at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:788) >at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:803) >at >com.here.scbe.search.solr.SolrFacadeImpl.deleteById(SolrFacadeImpl.java:25 >7) >. >. >Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: >KeeperErrorCode = ConnectionLoss for >/collections/qa_eu-west-1_here_account/state.json >at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) >at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) >at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) >at >org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345) >at >org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342) >at >org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.ja >va:61) >at >org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342) >at >org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader >.java:841) > >I have a few questions >1) Is this behavior (lack of self-healing) a known behavior? >2) Is this the same or similar behavior as documented here >https://issues.apache.org/jira/browse/SOLR-5129 >3) If it is not covered by #2 should I log it in JIRA? > >Thanks and Best Wishes, > >-Frank > >p.s. I can add Solr log files if they will help > > >Frank Kelly >Principal Software Engineer >Predictive Analytics Team (SCBE/HAC/CDA) > > > > > > >HERE >5 Wayside Rd, Burlington, MA 01803, USA >42° 29' 7" N 71° 11' 32² W > > > > > >