I'm working with a cluster of solr-cloud servers at a configration of 10 shards and 4 replicas on each shard in stress environment. Planned production configuration is 10 shards and 15 replicas on each shard.
Current commit settings are as follows <autoSoftCommit> <maxDocs>500000</maxDocs> <maxTime>180000</maxTime> </autoSoftCommit> <autoCommit> <maxDocs>2000000</maxDocs> <maxTime>180000</maxTime> <openSearcher>false</openSearcher> </autoCommit> The application requires to index approximately 90 Million docs which is indexed in two ways a) Full indexing. It takes 4 hours to index 90 Million docs and the rate of docs coming to the searcher is around 6000 per second b) Incremental indexing. It takes an hour to index delta changes. Roughly there are 3 million changes and rate of docs coming to the searchers is 2500 per second I use two collections for example collection1 and collection2 Each collection has system settings at 12 GB of available RAM and quad core Intel(R) Xeon(R) CPU X5570 @ 2.93GHz Full indexing is always performed on a collection which is not serving live traffic and Once job is completed we swap collection so the collection with latest data serves traffic and other is inactive. The other mode of incremental indexing is performed always on the collection which is serving live traffic. The problem is in about 10 minutes of indexing is triggered, the replicas goes in to recovery mode. This happens on all the shards. In about 20 minutes or more rest of replicas start to fall into recovery mode. In about half an hour all replicas except the leader is in recovery mode. I cannot throttle the indexing load as that will increase our overall indexing time. So to overcome this issue, I remove all the replicas before the indexing is started and then add them after the indexing completes. The behavior(replicas falling into recovery mode) in incremental mode of indexing is troublesome as i cannot remove replicas during incremental indexing since it serves live traffic, i tried to throttle the speed at which documents are indexed but with no success as the cluster still goes on recovery. If i let the cluster as is the indexing eventually completes and also recovers after a while, but since this is serving live traffic i just cannot let these replicas go into recovery mode since it degrades the search performance also (from the tests performed). I tried different commit settings like the below a) No auto soft commit, no auto hard commit and a commit triggered at the end of indexing b) No auto soft commit, yes auto hard commit and a commit in the end of indexing c) Yes auto soft commit , no auto hard commit d) Yes auto soft commit , yes auto hard commit e) Different frequency setting for commits for above Unfortunately all the above yields the same behavior . The replicas still goes in recovery I have increased the zookeeper timeout from 30 seconds to 5 minutes and the problem persists. Is there any setting that would fix this issue ? ----- -goutham -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-Replicas-fall-into-recovery-mode-right-after-update-tp4181706.html Sent from the Solr - User mailing list archive at Nabble.com.