Emir, Yes there is a delete_by_query on every bulk insert. This delete_by_query deletes all the documents which are updated lesser than a day before the current time. Is bulk delete_by_query the reason?
On Wed, Jan 3, 2018 at 7:58 PM, Emir Arnautović < emir.arnauto...@sematext.com> wrote: > Do you have deletes by query while indexing or it is append only index? > > Regards, > Emir > -- > Monitoring - Log Management - Alerting - Anomaly Detection > Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > > > > > On 3 Jan 2018, at 12:16, sravan <sra...@caavo.com> wrote: > > > > SolrCloud Nodes going to recovery state during indexing > > > > > > We have solr cloud setup with the settings shared below. We have a > collection with 3 shards and a replica for each of them. > > > > Normal State(As soon as the whole cluster is restarted): > > - Status of all the shards is UP. > > - a bulk update request of 50 documents each takes < 100ms. > > - 6-10 simultaneous bulk updates. > > > > Nodes going to recover state after updates for 15-30 mins. > > - Some shards starts giving the following ERRORs: > > - o.a.s.h.RequestHandlerBase org.apache.solr.update.processor. > DistributedUpdateProcessor$DistributedUpdatesAsyncException: Async > exception during distributed update: Read timed out > > - o.a.s.u.StreamingSolrClients error > > java.net.SocketTimeoutException: > Read timed out > > - the following error is seen on the shard which goes to recovery > state. > > - too many updates received since start - startingUpdates no > longer overlaps with our currentUpdates. > > - Sometimes, the same shard even goes to DOWN state and needs a node > restart to come back. > > - a bulk update request of 50 documents takes more than 5 seconds. > Sometimes even >120 secs. This is seen for all the requests if at least one > node is in recovery state in the whole cluster. > > > > We have a standalone setup with the same collection schema which is able > to take update & query load without any errors. > > > > > > We have the following solrcloud setup. > > - setup in AWS. > > > > - Zookeeper Setup: > > - number of nodes: 3 > > - aws instance type: t2.small > > - instance memory: 2gb > > > > - Solr Setup: > > - Solr version: 6.6.0 > > - number of nodes: 3 > > - aws instance type: m5.xlarge > > - instance memory: 16gb > > - number of cores: 4 > > - JAVA HEAP: 8gb > > - JAVA VERSION: oracle java version "1.8.0_151" > > - GC settings: default CMS. > > > > collection settings: > > - number of shards: 3 > > - replication factor: 2 > > - total 6 replicas. > > - total number of documents in the collection: 12 million > > - total number of documents in each shard: 4 million > > - Each document has around 25 fields with 12 of them > containing textual analysers & filters. > > - Commit Strategy: > > - No explicit commits from application code. > > - Hard commit of 15 secs with OpenSearcher as false. > > - Soft commit of 10 mins. > > - Cache Strategy: > > - filter queries > > - number: 512 > > - autowarmCount: 100 > > - all other caches > > - number: 512 > > - autowarmCount: 0 > > - maxWarmingSearchers: 2 > > > > > > - We tried the following > > - commit strategy > > - hard commit - 150 secs > > - soft commit - 5 mins > > - with GCG1 garbage collector based on https://wiki.apache.org/solr/ > ShawnHeisey#Java_8_recommendation_for_Solr: > > - the nodes go to recover state in less than a minute. > > > > The issue is seen even when the leaders are balanced across the three > nodes. > > > > Can you help us find the soluttion to this problem? > > -- Regards, Sravan