We are running into a timing issue when trying to do a scripted deployment of
our Solr Cloud cluster.

Scenario to reproduce (sometimes):

1. launch 3 clean solr nodes connected to zookeeper.
2. create a 1 shard collection with replicas on each node.
3. load data (more will make the problem worse)
4. launch 3 more nodes
5. add replicas to each new node
6. once entire cluster is healthy, start killing first three nodes.

Depending on the timing, the second three nodes end up all in RECOVERING
state without a leader.  

This appears to be happening because when the first leader dies, all the new
nodes go into full replication recovery and if all the old boxes happen to
die during that state, the boxes are stuck. The boxes cannot serve requests
and they eventually (1-8 hours) go into RECOVERY_FAILED state. 

This state is easy to fix with a FORCELEADER call to the collections API,
but that's only remediation, not prevention.

My question is this: Why do the new nodes have to go into full replication
recovery when they are already up to date? I just added the replica, so it
shouldn't have to a new full replication again.

Jim




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Cloud-A-B-Deployment-Issue-tp4302810.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to