SolrCloud Nodes going to recovery state during indexing

sravan Wed, 03 Jan 2018 03:16:43 -0800

SolrCloud Nodes going to recovery state during indexing

We have solr cloud setup with the settings shared below. We have acollection with 3 shards and a replica for each of them.


Normal State(As soon as the whole cluster is restarted):
    - Status of all the shards is UP.
    - a bulk update request of 50 documents each takes < 100ms.
    - 6-10 simultaneous bulk updates.

Nodes going to recover state after updates for 15-30 mins.
    - Some shards starts giving the following ERRORs:

- o.a.s.h.RequestHandlerBaseorg.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException:Async exception during distributed update: Read timed out - o.a.s.u.StreamingSolrClients errorjava.net.SocketTimeoutException: Read timed out - the following error is seen on the shard which goes to recoverystate. - too many updates received since start - startingUpdates nolonger overlaps with our currentUpdates. - Sometimes, the same shard even goes to DOWN state and needs anode restart to come back. - a bulk update request of 50 documents takes more than 5 seconds.Sometimes even >120 secs. This is seen for all the requests if at leastone node is in recovery state in the whole cluster.

We have a standalone setup with the same collection schema which is ableto take update & query load without any errors.



We have the following solrcloud setup.
    - setup in AWS.

    - Zookeeper Setup:
        - number of nodes: 3
        - aws instance type: t2.small
        - instance memory: 2gb

    - Solr Setup:
        - Solr version: 6.6.0
        - number of nodes: 3
        - aws instance type: m5.xlarge
        - instance memory: 16gb
        - number of cores: 4
        - JAVA HEAP: 8gb
        - JAVA VERSION: oracle java version "1.8.0_151"
        - GC settings: default CMS.

        collection settings:
            - number of shards: 3
            - replication factor: 2
            - total 6 replicas.
            - total number of documents in the collection: 12 million
            - total number of documents in each shard: 4 million

- Each document has around 25 fields with 12 of themcontaining textual analysers & filters.

            - Commit Strategy:
                - No explicit commits from application code.
                - Hard commit of 15 secs with OpenSearcher as false.
                - Soft commit of 10 mins.
            - Cache Strategy:
                - filter queries
                    - number: 512
                    - autowarmCount: 100
                - all other caches
                    - number: 512
                    - autowarmCount: 0
            - maxWarmingSearchers: 2


- We tried the following
    - commit strategy
        - hard commit - 150 secs
        - soft commit - 5 mins

- with GCG1 garbage collector based onhttps://wiki.apache.org/solr/ShawnHeisey#Java_8_recommendation_for_Solr:

        - the nodes go to recover state in less than a minute.

The issue is seen even when the leaders are balanced across the threenodes.


Can you help us find the soluttion to this problem?

SolrCloud Nodes going to recovery state during indexing

Reply via email to