Hi Mark, I'm using "roughly" the 4x branch up to this commit: https://github.com/apache/lucene-solr/tree/25aaf44221e3a3be8fab1ba22f16b13f5df6c64c. Will you please point me to the jira that addressed this? I couldn't find it with "waitForLeaderToSeeDownState".
Thanks, Jessica On Wed, Apr 16, 2014 at 4:35 PM, Mark Miller <markrmil...@gmail.com> wrote: > What version are you testing? Thought we had addressed this. > -- > Mark Miller > about.me/markrmiller > > On April 16, 2014 at 6:02:09 PM, Jessica Mallet (mewmewb...@gmail.com) > wrote: > > Hi Furkan, > > Thanks for the reply. I understand the intent. However, in the case I > described, the follower is blocked on looking for a leader (throws the > pasted exception because it can't find the leader) before it participates > in election; therefore, it will never come up while the leader waits for it > to come up (they're deadlocked waiting for each other). What I'm suggesting > is that maybe the follower should just just skip > waitForLeaderToSeeDownState > when there's no leader (instead of failing with the pasted stacktrace) and > go ahead and start participating in election. That way the leader will see > more replicas come up, and they can sync with each other and move on. > > Thanks, > Jessica > > > On Sat, Apr 12, 2014 at 4:14 PM, Furkan KAMACI <furkankam...@gmail.com > >wrote: > > > Hi; > > > > There is an explanation as follows: "This is meant to protect the case > > where you stop a shard or it fails and then the first node to get started > > back up has stale data - you don't want it to just become the leader. So > we > > wait to see everyone we know about in the shard up to 3 or 5 min by > > default. Then we know all the shards participate in the leader election > and > > the leader will end up with all updates it should have." You can check it > > from here: > > > > > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201306.mbox/%3ccajt9wng_yykcxggentgcxguhhcjhidear-jygpgrnkaedrz...@mail.gmail.com%3E > > > > Thanks; > > Furkan KAMACI > > > > > > 2014-04-08 23:51 GMT+03:00 Jessica Mallet <mewmewb...@gmail.com>: > > > > > To clarify, when I said "leader" and "follower" I meant the old leader > > and > > > follower before the zookeeper session expiration. When they're > recovering > > > there's no leader. > > > > > > > > > On Tue, Apr 8, 2014 at 1:49 PM, Jessica Mallet <mewmewb...@gmail.com> > > > wrote: > > > > > > > I'm playing with dropping the cluster's connections to zookeeper and > > then > > > > reconnecting them, and during recovery, I always see this on the > > leader's > > > > logs: > > > > > > > > ElectionContext.java (line 361) Waiting until we see more replicas up > > for > > > > shard shard1: total=2 found=1 timeoutin=139902 > > > > > > > > and then on the follower, I see: > > > > SolrException.java (line 121) There was a problem finding the leader > in > > > > zk:org.apache.solr.common.SolrException: Could not get leader props > > > > at > > > > > > org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:958) > > > > at > > > > > > org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:922) > > > > at > > > > > > > > > > org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1463) > > > > at > > > > > > > > > > org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:380) > > > > at > > > > org.apache.solr.cloud.ZkController.access$100(ZkController.java:84) > > > > at > > > > org.apache.solr.cloud.ZkController$1.command(ZkController.java:232) > > > > at > > > > > > > > > > org.apache.solr.common.cloud.ConnectionManager$2$1.run(ConnectionManager.java:179) > > > > Caused by: org.apache.zookeeper.KeeperException$NoNodeException: > > > > KeeperErrorCode = NoNode for /collections/lc4/leaders/shard1 > > > > at > > > > org.apache.zookeeper.KeeperException.create(KeeperException.java:111) > > > > at > > > > org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > > > > at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151) > > > > at > > > > > > > > > > org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:273) > > > > at > > > > > > > > > > org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:270) > > > > at > > > > > > > > > > org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) > > > > at > > > > > > org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:270) > > > > at > > > > > > org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:936) > > > > ... 6 more > > > > > > > > They block each other's progress until leader decides to give up and > > not > > > > wait for more replicas to come up: > > > > > > > > ElectionContext.java (line 368) Was waiting for replicas to come up, > > but > > > > they are taking too long - assuming they won't come back till later > > > > > > > > and then recovery moves forward again. > > > > > > > > Should waitForLeaderToSeeDownState move on if there's no leader at > the > > > > moment? > > > > Thanks, > > > > Jessica > > > > > > > > > >