SolrCloud Nodes going to recovery state during indexing


We have solr cloud setup with the settings shared below. We have a collection with 3 shards and a replica for each of them.

Normal State(As soon as the whole cluster is restarted):
    - Status of all the shards is UP.
    - a bulk update request of 50 documents each takes < 100ms.
    - 6-10 simultaneous bulk updates.

Nodes going to recover state after updates for 15-30 mins.
    - Some shards starts giving the following ERRORs:
        - o.a.s.h.RequestHandlerBase org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException: Async exception during distributed update: Read timed out         - o.a.s.u.StreamingSolrClients error java.net.SocketTimeoutException: Read timed out     - the following error is seen on the shard which goes to recovery state.         - too many updates received since start - startingUpdates no longer overlaps with our currentUpdates.     - Sometimes, the same shard even goes to DOWN state and needs a node restart to come back.     - a bulk update request of 50 documents takes more than 5 seconds. Sometimes even >120 secs. This is seen for all the requests if at least one node is in recovery state in the whole cluster.

We have a standalone setup with the same collection schema which is able to take update & query load without any errors.


We have the following solrcloud setup.
    - setup in AWS.

    - Zookeeper Setup:
        - number of nodes: 3
        - aws instance type: t2.small
        - instance memory: 2gb

    - Solr Setup:
        - Solr version: 6.6.0
        - number of nodes: 3
        - aws instance type: m5.xlarge
        - instance memory: 16gb
        - number of cores: 4
        - JAVA HEAP: 8gb
        - JAVA VERSION: oracle java version "1.8.0_151"
        - GC settings: default CMS.

        collection settings:
            - number of shards: 3
            - replication factor: 2
            - total 6 replicas.
            - total number of documents in the collection: 12 million
            - total number of documents in each shard: 4 million
            - Each document has around 25 fields with 12 of them containing textual analysers & filters.
            - Commit Strategy:
                - No explicit commits from application code.
                - Hard commit of 15 secs with OpenSearcher as false.
                - Soft commit of 10 mins.
            - Cache Strategy:
                - filter queries
                    - number: 512
                    - autowarmCount: 100
                - all other caches
                    - number: 512
                    - autowarmCount: 0
            - maxWarmingSearchers: 2


- We tried the following
    - commit strategy
        - hard commit - 150 secs
        - soft commit - 5 mins
    - with GCG1 garbage collector based on https://wiki.apache.org/solr/ShawnHeisey#Java_8_recommendation_for_Solr:
        - the nodes go to recover state in less than a minute.

The issue is seen even when the leaders are balanced across the three nodes.

Can you help us find the soluttion to this problem?

Reply via email to