To clarify, when I said "leader" and "follower" I meant the old leader and follower before the zookeeper session expiration. When they're recovering there's no leader.
On Tue, Apr 8, 2014 at 1:49 PM, Jessica Mallet <mewmewb...@gmail.com> wrote: > I'm playing with dropping the cluster's connections to zookeeper and then > reconnecting them, and during recovery, I always see this on the leader's > logs: > > ElectionContext.java (line 361) Waiting until we see more replicas up for > shard shard1: total=2 found=1 timeoutin=139902 > > and then on the follower, I see: > SolrException.java (line 121) There was a problem finding the leader in > zk:org.apache.solr.common.SolrException: Could not get leader props > at > org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:958) > at > org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:922) > at > org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1463) > at > org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:380) > at > org.apache.solr.cloud.ZkController.access$100(ZkController.java:84) > at > org.apache.solr.cloud.ZkController$1.command(ZkController.java:232) > at > org.apache.solr.common.cloud.ConnectionManager$2$1.run(ConnectionManager.java:179) > Caused by: org.apache.zookeeper.KeeperException$NoNodeException: > KeeperErrorCode = NoNode for /collections/lc4/leaders/shard1 > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:111) > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151) > at > org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:273) > at > org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:270) > at > org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) > at > org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:270) > at > org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:936) > ... 6 more > > They block each other's progress until leader decides to give up and not > wait for more replicas to come up: > > ElectionContext.java (line 368) Was waiting for replicas to come up, but > they are taking too long - assuming they won't come back till later > > and then recovery moves forward again. > > Should waitForLeaderToSeeDownState move on if there's no leader at the > moment? > Thanks, > Jessica >