SolrCloud Nodes going to recovery state during indexing
We have solr cloud setup with the settings shared below. We have a
collection with 3 shards and a replica for each of them.
Normal State(As soon as the whole cluster is restarted):
- Status of all the shards is UP.
- a bulk update request of 50 documents each takes < 100ms.
- 6-10 simultaneous bulk updates.
Nodes going to recover state after updates for 15-30 mins.
- Some shards starts giving the following ERRORs:
- o.a.s.h.RequestHandlerBase
org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException:
Async exception during distributed update: Read timed out
- o.a.s.u.StreamingSolrClients error
java.net.SocketTimeoutException: Read timed out
- the following error is seen on the shard which goes to recovery
state.
- too many updates received since start - startingUpdates no
longer overlaps with our currentUpdates.
- Sometimes, the same shard even goes to DOWN state and needs a
node restart to come back.
- a bulk update request of 50 documents takes more than 5 seconds.
Sometimes even >120 secs. This is seen for all the requests if at least
one node is in recovery state in the whole cluster.
We have a standalone setup with the same collection schema which is able
to take update & query load without any errors.
We have the following solrcloud setup.
- setup in AWS.
- Zookeeper Setup:
- number of nodes: 3
- aws instance type: t2.small
- instance memory: 2gb
- Solr Setup:
- Solr version: 6.6.0
- number of nodes: 3
- aws instance type: m5.xlarge
- instance memory: 16gb
- number of cores: 4
- JAVA HEAP: 8gb
- JAVA VERSION: oracle java version "1.8.0_151"
- GC settings: default CMS.
collection settings:
- number of shards: 3
- replication factor: 2
- total 6 replicas.
- total number of documents in the collection: 12 million
- total number of documents in each shard: 4 million
- Each document has around 25 fields with 12 of them
containing textual analysers & filters.
- Commit Strategy:
- No explicit commits from application code.
- Hard commit of 15 secs with OpenSearcher as false.
- Soft commit of 10 mins.
- Cache Strategy:
- filter queries
- number: 512
- autowarmCount: 100
- all other caches
- number: 512
- autowarmCount: 0
- maxWarmingSearchers: 2
- We tried the following
- commit strategy
- hard commit - 150 secs
- soft commit - 5 mins
- with GCG1 garbage collector based on
https://wiki.apache.org/solr/ShawnHeisey#Java_8_recommendation_for_Solr:
- the nodes go to recover state in less than a minute.
The issue is seen even when the leaders are balanced across the three
nodes.
Can you help us find the soluttion to this problem?