We are running into a timing issue when trying to do a scripted deployment of our Solr Cloud cluster.
Scenario to reproduce (sometimes): 1. launch 3 clean solr nodes connected to zookeeper. 2. create a 1 shard collection with replicas on each node. 3. load data (more will make the problem worse) 4. launch 3 more nodes 5. add replicas to each new node 6. once entire cluster is healthy, start killing first three nodes. Depending on the timing, the second three nodes end up all in RECOVERING state without a leader. This appears to be happening because when the first leader dies, all the new nodes go into full replication recovery and if all the old boxes happen to die during that state, the boxes are stuck. The boxes cannot serve requests and they eventually (1-8 hours) go into RECOVERY_FAILED state. This state is easy to fix with a FORCELEADER call to the collections API, but that's only remediation, not prevention. My question is this: Why do the new nodes have to go into full replication recovery when they are already up to date? I just added the replica, so it shouldn't have to a new full replication again. Jim -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Cloud-A-B-Deployment-Issue-tp4302810.html Sent from the Solr - User mailing list archive at Nabble.com.