I'm playing with dropping the cluster's connections to zookeeper and then
reconnecting them, and during recovery, I always see this on the leader's
logs:

ElectionContext.java (line 361) Waiting until we see more replicas up for
shard shard1: total=2 found=1 timeoutin=139902

and then on the follower, I see:
SolrException.java (line 121) There was a problem finding the leader in
zk:org.apache.solr.common.SolrException: Could not get leader props
        at
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:958)
        at
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:922)
        at
org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1463)
        at
org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:380)
        at
org.apache.solr.cloud.ZkController.access$100(ZkController.java:84)
        at
org.apache.solr.cloud.ZkController$1.command(ZkController.java:232)
        at
org.apache.solr.common.cloud.ConnectionManager$2$1.run(ConnectionManager.java:179)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
KeeperErrorCode = NoNode for /collections/lc4/leaders/shard1
        at
org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
        at
org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
        at
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:273)
        at
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:270)
        at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
        at
org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:270)
        at
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:936)
        ... 6 more

They block each other's progress until leader decides to give up and not
wait for more replicas to come up:

ElectionContext.java (line 368) Was waiting for replicas to come up, but
they are taking too long - assuming they won't come back till later

and then recovery moves forward again.

Should waitForLeaderToSeeDownState move on if there's no leader at the
moment?
Thanks,
Jessica

Reply via email to