Re: RETRY: SolrCloud does not recover after ZooKeeper ensemble loses (and then regains) a quorum

Kelly, Frank Sat, 19 Mar 2016 07:36:45 -0700

Thanks for taking look
I’m not sure https://issues.apache.org/jira/browse/SOLR-8326 is a match as
we aren’t using PKIAuthPlugin


-Frank

Frank Kelly
Principal Software Engineer
Predictive Analytics Team (SCBE/HAC/CDA)

HERE 
5 Wayside Rd, Burlington, MA 01803, USA
42° 29' 7" N 71° 11' 32” W

 <http://360.here.com/>   <https://twitter.com/here>
<https://www.facebook.com/here>    <https://linkedin.com/company/heremaps>
   <https://www.instagram.com/here>







On 3/18/16, 1:49 PM, "Oakley, Craig (NIH/NLM/NCBI) [C]"
<craig.oak...@nih.gov> wrote:

>I am wondering whether this might be the bug of SOLR-8326, which is fixed
>in Solr 5.4
>
>That's my guess as a user who ran into the bug myself.
>
>-----Original Message-----
>From: Kelly, Frank [mailto:frank.ke...@here.com]
>Sent: Wednesday, March 16, 2016 3:09 PM
>To: solr-user@lucene.apache.org
>Subject: Re: RETRY: SolrCloud does not recover after ZooKeeper ensemble
>loses (and then regains) a quorum
>
>Any thoughts on this?
>
>Hoping for just a quick
>1) Yes - once ZooKeeper loses a Quorum you need to restart Solr and your
>SolrJ Client
>2) No - that¹s not expected behavior - Solr and SolrJ should recover -
>please file a JIRA issue
>
>Cheers!
>
>Frank Kelly
>Principal Software Engineer
>Predictive Analytics Team (SCBE/HAC/CDA)
>
>HERE 
>5 Wayside Rd, Burlington, MA 01803, USA
>42° 29' 7" N 71° 11' 32² W
>
> <http://360.here.com/>   <https://twitter.com/here>
><https://www.facebook.com/here>    <https://linkedin.com/company/heremaps>
>   <https://www.instagram.com/here>
>
>
>
>
>
>
>
>On 3/16/16, 8:54 AM, "Kelly, Frank" <frank.ke...@here.com> wrote:
>
>><This time without images :-) >
>>
>>Just wondering if my observation of SolrCloud behavior after ZooKeeper
>>loses a quorum is normal or to-be-expected
>>
>>Version of Solr: 5.3.1
>>Version of ZooKeeper: 3.4.7
>>Using SolrCloud with external ZooKeeper
>>Deployed on AWS
>>
>>Our Solr cluster has 3 nodes
>>
>>Our Zookeeper ensemble consists of three nodes with the same config using
>>DNS names e.g.
>>
>>$ more ../conf/zoo.cfg
>>tickTime=2000
>>dataDir=/var/zookeeper
>>dataLogDir=/var/log/zookeeper
>>clientPort=2181
>>initLimit=10
>>syncLimit=5
>>standaloneEnabled=false
>>server.1=zookeeper1.qa.eu-west-1.mysearch.com:2888:3888
>>server.2=zookeeper2.qa.eu-west-1.mysearch.com:2888:3888
>>server.3=zookeeper3.qa.eu-west-1.mysearch.com:2888:3888
>>
>>If we terminate one of the zookeeper nodes we get a ZK election (and I
>>think) a quorum is maintained.
>>Operation continues OK and we detect the terminated instance and relaunch
>>a new ZK node which comes up fine
>>
>>If we terminate two of the ZK nodes we lose a quorum and then we observe
>>the following
>>
>>1.1) Admin UI shows an error that it is unable to contact ZooKeeper
>>³Could not connect to ZooKeeper"
>>
>>1.2) SolrJ returns the following
>>
>>org.apache.solr.common.SolrException: Could not load collection from
>>ZK:qa_eu-west-1_public_index
>>at 
>>org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReade
>>r
>>.java:850)
>>at 
>>org.apache.solr.common.cloud.ZkStateReader$7.get(ZkStateReader.java:515)
>>at 
>>org.apache.solr.client.solrj.impl.CloudSolrClient.getDocCollection(CloudS
>>o
>>lrClient.java:1205)
>>at 
>>org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStale
>>S
>>tate(CloudSolrClient.java:837)
>>at 
>>org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient
>>.
>>java:805)
>>at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135)
>>at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:107)
>>at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:72)
>>at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:86)
>>at 
>>com.here.scbe.search.solr.SolrFacadeImpl.addToSearchIndex(SolrFacadeImpl.
>>j
>>ava:112)
>>Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
>>KeeperErrorCode = ConnectionLoss for
>>/collections/qa_eu-west-1_public_index/state.json
>>at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>>at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>>at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
>>at 
>>org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345
>>)
>>at 
>>org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342
>>)
>>at 
>>org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.j
>>a
>>va:61)
>>at 
>>org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342)
>>at 
>>org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReade
>>r
>>.java:841)
>>... 24 more
>>
>>This makes sense based on our understanding.
>>When our AutoScale groups launch two new ZooKeeper nodes, initialize
>>them, fix the DNS etc. we regain a quorum but at this point
>>
>>2.1) Admin UI shows the shards as ³GONE² (all greyed out)
>>
>>2.2) SolrJ returns the same error even though the ZooKeeper DNS names are
>>now bound to new IP addresses
>>
>>So at this point I restart the Solr nodes. At this point then
>>
>>3.1) Admin UI shows the collections as OK (all shards are green)  yeah
>>the nodes are back!
>>
>>3.2) SolrJ Client still shows the same error  namely
>>
>>org.apache.solr.common.SolrException: Could not load collection from
>>ZK:qa_eu-west-1_here_account
>>at 
>>org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReade
>>r
>>.java:850)
>>at 
>>org.apache.solr.common.cloud.ZkStateReader$7.get(ZkStateReader.java:515)
>>at 
>>org.apache.solr.client.solrj.impl.CloudSolrClient.getDocCollection(CloudS
>>o
>>lrClient.java:1205)
>>at 
>>org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStale
>>S
>>tate(CloudSolrClient.java:837)
>>at 
>>org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient
>>.
>>java:805)
>>at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135)
>>at 
>>org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:825)
>>at 
>>org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:788)
>>at 
>>org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:803)
>>at 
>>com.here.scbe.search.solr.SolrFacadeImpl.deleteById(SolrFacadeImpl.java:2
>>5
>>7)
>>.
>>.
>>Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
>>KeeperErrorCode = ConnectionLoss for
>>/collections/qa_eu-west-1_here_account/state.json
>>at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>>at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>>at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
>>at 
>>org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345
>>)
>>at 
>>org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342
>>)
>>at 
>>org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.j
>>a
>>va:61)
>>at 
>>org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342)
>>at 
>>org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReade
>>r
>>.java:841)
>>
>>I have a few questions
>>1) Is this behavior (lack of self-healing) a known behavior?
>>2) Is this the same or similar behavior as documented here
>>https://issues.apache.org/jira/browse/SOLR-5129
>>3) If it is not covered by #2 should I log it in JIRA?
>>
>>Thanks and Best Wishes,
>>
>>-Frank
>>
>>p.s. I can add Solr log files if they will help
>>
>>
>>Frank Kelly
>>Principal Software Engineer
>>Predictive Analytics Team (SCBE/HAC/CDA)
>>
>>
>>
>>
>>
>>
>>HERE
>>5 Wayside Rd, Burlington, MA 01803, USA
>>42° 29' 7" N 71° 11' 32² W
>>
>>
>>
>>
>>
>>
>

Re: RETRY: SolrCloud does not recover after ZooKeeper ensemble loses (and then regains) a quorum

Reply via email to