[ https://issues.apache.org/jira/browse/SOLR-8868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jan Høydahl resolved SOLR-8868. ------------------------------- Resolution: Done I'm closing this as I'm pretty sure it is caused by ZOOKEEPER-2184 which is solved by SOLR-12727 in 7.7 > SolrCloud: if zookeeper loses and then regains a quorum, Solr nodes and SolrJ > Client do not recover and need to be restarted > ---------------------------------------------------------------------------------------------------------------------------- > > Key: SOLR-8868 > URL: https://issues.apache.org/jira/browse/SOLR-8868 > Project: Solr > Issue Type: Bug > Components: SolrCloud, SolrJ > Affects Versions: 5.3.1 > Reporter: Frank J Kelly > Priority: Major > > Tried mailing list on 3/15 and 3/16 to no avail. Hopefully I gave enough > details. > ---- > Just wondering if my observation of SolrCloud behavior after ZooKeeper loses > a quorum is normal or to-be-expected > Version of Solr: 5.3.1 > Version of ZooKeeper: 3.4.7 > Using SolrCloud with external ZooKeeper > Deployed on AWS > Our Solr cluster has 3 nodes (m3.large) > Our Zookeeper ensemble consists of three nodes (t2.small) with the same > config using DNS names e.g. > {noformat} > $ more ../conf/zoo.cfg > tickTime=2000 > dataDir=/var/zookeeper > dataLogDir=/var/log/zookeeper > clientPort=2181 > initLimit=10 > syncLimit=5 > standaloneEnabled=false > server.1=zookeeper1.qa.eu-west-1.mysearch.com:2888:3888 > server.2=zookeeper2.qa.eu-west-1.mysearch.com:2888:3888 > server.3=zookeeper3.qa.eu-west-1.mysearch.com:2888:3888 > {noformat} > If we terminate one of the zookeeper nodes we get a ZK election (and I think) > a quorum is maintained. > Operation continues OK and we detect the terminated instance and relaunch a > new ZK node which comes up fine > If we terminate two of the ZK nodes we lose a quorum and then we observe the > following > 1.1) Admin UI shows an error that it is unable to contact ZooKeeper “Could > not connect to ZooKeeper" > 1.2) SolrJ returns the following > {noformat} > org.apache.solr.common.SolrException: Could not load collection from > ZK:qa_eu-west-1_public_index > at > org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:850) > at org.apache.solr.common.cloud.ZkStateReader$7.get(ZkStateReader.java:515) > at > org.apache.solr.client.solrj.impl.CloudSolrClient.getDocCollection(CloudSolrClient.java:1205) > at > org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:837) > at > org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:805) > at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135) > at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:107) > at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:72) > at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:86) > at > com.here.scbe.search.solr.SolrFacadeImpl.addToSearchIndex(SolrFacadeImpl.java:112) > Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: > KeeperErrorCode = ConnectionLoss for > /collections/qa_eu-west-1_public_index/state.json > at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) > at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345) > at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342) > at > org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61) > at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342) > at > org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:841) > ... 24 more > {noformat} > This makes sense based on our understanding. > When our AutoScale groups launch two new ZooKeeper nodes, initialize them, > fix the DNS etc. we regain a quorum but at this point > 2.1) Admin UI shows the shards as “GONE” (all greyed out) > 2.2) SolrJ returns the same error even though the ZooKeeper DNS names are now > bound to new IP addresses > So at this point I restart the Solr nodes. At this point then > 3.1) Admin UI shows the collections as OK (all shards are green) – yeah the > nodes are back! > 3.2) SolrJ Client still shows the same error – namely > {noformat} > org.apache.solr.common.SolrException: Could not load collection from > ZK:qa_eu-west-1_here_account > at > org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:850) > at org.apache.solr.common.cloud.ZkStateReader$7.get(ZkStateReader.java:515) > at > org.apache.solr.client.solrj.impl.CloudSolrClient.getDocCollection(CloudSolrClient.java:1205) > at > org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:837) > at > org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:805) > at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135) > at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:825) > at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:788) > at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:803) > at > com.here.scbe.search.solr.SolrFacadeImpl.deleteById(SolrFacadeImpl.java:257) > . > . > Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: > KeeperErrorCode = ConnectionLoss for > /collections/qa_eu-west-1_here_account/state.json > at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) > at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345) > at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342) > at > org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61) > at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342) > at > org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:841) > {noformat} > Is this behavior (lack of self-healing) a known and expected behavior? > If this is expected behavior then likely this should be recast as an > Improvement request? > Is this the same or similar behavior as documented here > https://issues.apache.org/jira/browse/SOLR-5129 > p.s. I can add Solr log files if they will help -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org