This is a protective measure. When it looks like a shard is first coming up, we wait to see all the expected shards, or for a timeout, to ensure that everyone participates in the initial sync process - if all the nodes went down, we don't know what documents made it where, and we don't want to lose any updates.
- Mark On Nov 28, 2012, at 10:47 AM, Daniel Collins <danwcoll...@gmail.com> wrote: > I was testing the basic SolrCloud test scenario from the wiki page, and > found something (I considered) unexpected. > > If the leader of the shard goes down, when it comes back up it requires N > replicas to be running (where N is determined from what was running before > I think). > > Simple setup, 4 servers, 2 shards (A, B), each with 2 replicas, e.g. A1, > A2, B1, B2. > > All 4 nodes start-up, A1, B1 are leaders, all is well. > > A2 brought down, cloud is still fine. A2 brought back up and recovers, once > recovery complete, it is live. > > A2 goes down, then A1. Cloud is now unresponsive as Shard A has no nodes > (as expected). > > A1 comes back up. However, shard is still not responsive due to errors > > 2012-11-28 10:45:27,328 INFO [main] o.a.s.c.ShardLeaderElectionContext > [ElectionContext.java:287] Waiting until we see more replicas up: total=2 > found=1 timeoutin=140262 > > I can understand that in the cloud setup A1 (if it wasn't the leader) would > have to recover, but as A1 was leader when it went down, shouldn't it be > able to service requests on its own (it was when it went down!)