This is a protective measure. When it looks like a shard is first coming up, we 
wait to see all the expected shards, or for a timeout, to ensure that everyone 
participates in the initial sync process - if all the nodes went down, we don't 
know what documents made it where, and we don't want to lose any updates.

- Mark

On Nov 28, 2012, at 10:47 AM, Daniel Collins <danwcoll...@gmail.com> wrote:

> I was testing the basic SolrCloud test scenario from the wiki page, and
> found something (I considered) unexpected.
> 
> If the leader of the shard goes down, when it comes back up it requires N
> replicas to be running (where N is determined from what was running before
> I think).
> 
> Simple setup, 4 servers, 2 shards (A, B), each with 2 replicas, e.g. A1,
> A2, B1, B2.
> 
> All 4 nodes start-up, A1, B1 are leaders, all is well.
> 
> A2 brought down, cloud is still fine. A2 brought back up and recovers, once
> recovery complete, it is live.
> 
> A2 goes down, then A1.  Cloud is now unresponsive as Shard A has no nodes
> (as expected).
> 
> A1 comes back up.  However, shard is still not responsive due to errors
> 
> 2012-11-28 10:45:27,328 INFO [main] o.a.s.c.ShardLeaderElectionContext
> [ElectionContext.java:287] Waiting until we see more replicas up: total=2
> found=1 timeoutin=140262
> 
> I can understand that in the cloud setup A1 (if it wasn't the leader) would
> have to recover, but as A1 was leader when it went down, shouldn't it be
> able to service requests on its own (it was when it went down!)

Reply via email to