Thanks for taking look I’m not sure https://issues.apache.org/jira/browse/SOLR-8326 is a match as we aren’t using PKIAuthPlugin
-Frank Frank Kelly Principal Software Engineer Predictive Analytics Team (SCBE/HAC/CDA) HERE 5 Wayside Rd, Burlington, MA 01803, USA 42° 29' 7" N 71° 11' 32” W <http://360.here.com/> <https://twitter.com/here> <https://www.facebook.com/here> <https://linkedin.com/company/heremaps> <https://www.instagram.com/here> On 3/18/16, 1:49 PM, "Oakley, Craig (NIH/NLM/NCBI) [C]" <craig.oak...@nih.gov> wrote: >I am wondering whether this might be the bug of SOLR-8326, which is fixed >in Solr 5.4 > >That's my guess as a user who ran into the bug myself. > >-----Original Message----- >From: Kelly, Frank [mailto:frank.ke...@here.com] >Sent: Wednesday, March 16, 2016 3:09 PM >To: solr-user@lucene.apache.org >Subject: Re: RETRY: SolrCloud does not recover after ZooKeeper ensemble >loses (and then regains) a quorum > >Any thoughts on this? > >Hoping for just a quick >1) Yes - once ZooKeeper loses a Quorum you need to restart Solr and your >SolrJ Client >2) No - that¹s not expected behavior - Solr and SolrJ should recover - >please file a JIRA issue > >Cheers! > >Frank Kelly >Principal Software Engineer >Predictive Analytics Team (SCBE/HAC/CDA) > >HERE >5 Wayside Rd, Burlington, MA 01803, USA >42° 29' 7" N 71° 11' 32² W > > <http://360.here.com/> <https://twitter.com/here> ><https://www.facebook.com/here> <https://linkedin.com/company/heremaps> > <https://www.instagram.com/here> > > > > > > > >On 3/16/16, 8:54 AM, "Kelly, Frank" <frank.ke...@here.com> wrote: > >><This time without images :-) > >> >>Just wondering if my observation of SolrCloud behavior after ZooKeeper >>loses a quorum is normal or to-be-expected >> >>Version of Solr: 5.3.1 >>Version of ZooKeeper: 3.4.7 >>Using SolrCloud with external ZooKeeper >>Deployed on AWS >> >>Our Solr cluster has 3 nodes >> >>Our Zookeeper ensemble consists of three nodes with the same config using >>DNS names e.g. >> >>$ more ../conf/zoo.cfg >>tickTime=2000 >>dataDir=/var/zookeeper >>dataLogDir=/var/log/zookeeper >>clientPort=2181 >>initLimit=10 >>syncLimit=5 >>standaloneEnabled=false >>server.1=zookeeper1.qa.eu-west-1.mysearch.com:2888:3888 >>server.2=zookeeper2.qa.eu-west-1.mysearch.com:2888:3888 >>server.3=zookeeper3.qa.eu-west-1.mysearch.com:2888:3888 >> >>If we terminate one of the zookeeper nodes we get a ZK election (and I >>think) a quorum is maintained. >>Operation continues OK and we detect the terminated instance and relaunch >>a new ZK node which comes up fine >> >>If we terminate two of the ZK nodes we lose a quorum and then we observe >>the following >> >>1.1) Admin UI shows an error that it is unable to contact ZooKeeper >>³Could not connect to ZooKeeper" >> >>1.2) SolrJ returns the following >> >>org.apache.solr.common.SolrException: Could not load collection from >>ZK:qa_eu-west-1_public_index >>at >>org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReade >>r >>.java:850) >>at >>org.apache.solr.common.cloud.ZkStateReader$7.get(ZkStateReader.java:515) >>at >>org.apache.solr.client.solrj.impl.CloudSolrClient.getDocCollection(CloudS >>o >>lrClient.java:1205) >>at >>org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStale >>S >>tate(CloudSolrClient.java:837) >>at >>org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient >>. >>java:805) >>at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135) >>at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:107) >>at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:72) >>at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:86) >>at >>com.here.scbe.search.solr.SolrFacadeImpl.addToSearchIndex(SolrFacadeImpl. >>j >>ava:112) >>Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: >>KeeperErrorCode = ConnectionLoss for >>/collections/qa_eu-west-1_public_index/state.json >>at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) >>at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) >>at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) >>at >>org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345 >>) >>at >>org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342 >>) >>at >>org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.j >>a >>va:61) >>at >>org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342) >>at >>org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReade >>r >>.java:841) >>... 24 more >> >>This makes sense based on our understanding. >>When our AutoScale groups launch two new ZooKeeper nodes, initialize >>them, fix the DNS etc. we regain a quorum but at this point >> >>2.1) Admin UI shows the shards as ³GONE² (all greyed out) >> >>2.2) SolrJ returns the same error even though the ZooKeeper DNS names are >>now bound to new IP addresses >> >>So at this point I restart the Solr nodes. At this point then >> >>3.1) Admin UI shows the collections as OK (all shards are green) yeah >>the nodes are back! >> >>3.2) SolrJ Client still shows the same error namely >> >>org.apache.solr.common.SolrException: Could not load collection from >>ZK:qa_eu-west-1_here_account >>at >>org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReade >>r >>.java:850) >>at >>org.apache.solr.common.cloud.ZkStateReader$7.get(ZkStateReader.java:515) >>at >>org.apache.solr.client.solrj.impl.CloudSolrClient.getDocCollection(CloudS >>o >>lrClient.java:1205) >>at >>org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStale >>S >>tate(CloudSolrClient.java:837) >>at >>org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient >>. >>java:805) >>at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135) >>at >>org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:825) >>at >>org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:788) >>at >>org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:803) >>at >>com.here.scbe.search.solr.SolrFacadeImpl.deleteById(SolrFacadeImpl.java:2 >>5 >>7) >>. >>. >>Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: >>KeeperErrorCode = ConnectionLoss for >>/collections/qa_eu-west-1_here_account/state.json >>at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) >>at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) >>at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) >>at >>org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345 >>) >>at >>org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342 >>) >>at >>org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.j >>a >>va:61) >>at >>org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342) >>at >>org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReade >>r >>.java:841) >> >>I have a few questions >>1) Is this behavior (lack of self-healing) a known behavior? >>2) Is this the same or similar behavior as documented here >>https://issues.apache.org/jira/browse/SOLR-5129 >>3) If it is not covered by #2 should I log it in JIRA? >> >>Thanks and Best Wishes, >> >>-Frank >> >>p.s. I can add Solr log files if they will help >> >> >>Frank Kelly >>Principal Software Engineer >>Predictive Analytics Team (SCBE/HAC/CDA) >> >> >> >> >> >> >>HERE >>5 Wayside Rd, Burlington, MA 01803, USA >>42° 29' 7" N 71° 11' 32² W >> >> >> >> >> >> >