[jira] [Resolved] (SOLR-8868) SolrCloud: if zookeeper loses and then regains a quorum, Solr nodes and SolrJ Client do not recover and need to be restarted

Jira Fri, 03 Apr 2020 15:50:25 -0700


     [ 
https://issues.apache.org/jira/browse/SOLR-8868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jan Høydahl resolved SOLR-8868.
-------------------------------
    Resolution: Done

I'm closing this as I'm pretty sure it is caused by ZOOKEEPER-2184 which is 
solved by SOLR-12727 in 7.7

> SolrCloud: if zookeeper loses and then regains a quorum, Solr nodes and SolrJ 
> Client do not recover and need to be restarted
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-8868
>                 URL: https://issues.apache.org/jira/browse/SOLR-8868
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud, SolrJ
>    Affects Versions: 5.3.1
>            Reporter: Frank J Kelly
>            Priority: Major
>
> Tried mailing list on 3/15 and 3/16 to no avail. Hopefully I gave enough 
> details.
> ----
> Just wondering if my observation of SolrCloud behavior after ZooKeeper loses 
> a quorum is normal or to-be-expected
> Version of Solr: 5.3.1
> Version of ZooKeeper: 3.4.7
> Using SolrCloud with external ZooKeeper
> Deployed on AWS
> Our Solr cluster has 3 nodes (m3.large)
> Our Zookeeper ensemble consists of three nodes (t2.small) with the same 
> config using DNS names e.g.
> {noformat}
> $ more ../conf/zoo.cfg
> tickTime=2000
> dataDir=/var/zookeeper
> dataLogDir=/var/log/zookeeper
> clientPort=2181
> initLimit=10
> syncLimit=5
> standaloneEnabled=false
> server.1=zookeeper1.qa.eu-west-1.mysearch.com:2888:3888
> server.2=zookeeper2.qa.eu-west-1.mysearch.com:2888:3888
> server.3=zookeeper3.qa.eu-west-1.mysearch.com:2888:3888
> {noformat}
> If we terminate one of the zookeeper nodes we get a ZK election (and I think) 
> a quorum is maintained.
> Operation continues OK and we detect the terminated instance and relaunch a 
> new ZK node which comes up fine
> If we terminate two of the ZK nodes we lose a quorum and then we observe the 
> following
> 1.1) Admin UI shows an error that it is unable to contact ZooKeeper “Could 
> not connect to ZooKeeper"
> 1.2) SolrJ returns the following
> {noformat}
> org.apache.solr.common.SolrException: Could not load collection from 
> ZK:qa_eu-west-1_public_index
> at 
> org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:850)
> at org.apache.solr.common.cloud.ZkStateReader$7.get(ZkStateReader.java:515)
> at 
> org.apache.solr.client.solrj.impl.CloudSolrClient.getDocCollection(CloudSolrClient.java:1205)
> at 
> org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:837)
> at 
> org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:805)
> at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135)
> at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:107)
> at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:72)
> at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:86)
> at 
> com.here.scbe.search.solr.SolrFacadeImpl.addToSearchIndex(SolrFacadeImpl.java:112)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: 
> KeeperErrorCode = ConnectionLoss for 
> /collections/qa_eu-west-1_public_index/state.json
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
> at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345)
> at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342)
> at 
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
> at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342)
> at 
> org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:841)
> ... 24 more
> {noformat}
> This makes sense based on our understanding.
> When our AutoScale groups launch two new ZooKeeper nodes, initialize them, 
> fix the DNS etc. we regain a quorum but at this point
> 2.1) Admin UI shows the shards as “GONE” (all greyed out)
> 2.2) SolrJ returns the same error even though the ZooKeeper DNS names are now 
> bound to new IP addresses
> So at this point I restart the Solr nodes. At this point then
> 3.1) Admin UI shows the collections as OK (all shards are green) – yeah the 
> nodes are back!
> 3.2) SolrJ Client still shows the same error – namely
> {noformat}
> org.apache.solr.common.SolrException: Could not load collection from 
> ZK:qa_eu-west-1_here_account
> at 
> org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:850)
> at org.apache.solr.common.cloud.ZkStateReader$7.get(ZkStateReader.java:515)
> at 
> org.apache.solr.client.solrj.impl.CloudSolrClient.getDocCollection(CloudSolrClient.java:1205)
> at 
> org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:837)
> at 
> org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:805)
> at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135)
> at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:825)
> at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:788)
> at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:803)
> at 
> com.here.scbe.search.solr.SolrFacadeImpl.deleteById(SolrFacadeImpl.java:257)
> .
> .
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: 
> KeeperErrorCode = ConnectionLoss for 
> /collections/qa_eu-west-1_here_account/state.json
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
> at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345)
> at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342)
> at 
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
> at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342)
> at 
> org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:841)
> {noformat}
> Is this behavior (lack of self-healing) a known and expected behavior?
> If this is expected behavior then likely this should be recast as an 
> Improvement request?
> Is this the same or similar behavior as documented here 
> https://issues.apache.org/jira/browse/SOLR-5129
> p.s. I can add Solr log files if they will help



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (SOLR-8868) SolrCloud: if zookeeper loses and then regains a quorum, Solr nodes and SolrJ Client do not recover and need to be restarted

Reply via email to