If I'm reading this correctly, you have a huge amount of index in not much memory. You only have 14g allocated across 130 replicas, at least one of which has a 20g index. You don't need as much memory as your aggregate index size, but this system feels severely under provisioned. I suspect that's the root of your instability
Best, Erick On Thu, Sep 5, 2019, 07:08 Doss <itsmed...@gmail.com> wrote: > Hi, > > We are using 3 node SOLR (7.0.1) cloud setup 1 node zookeeper ensemble. > Each system has 16CPUs, 90GB RAM (14GB HEAP), 130 cores (3 replicas NRT) > with index size ranging from 700MB to 20GB. > > autoCommit - 10 minutes once > softCommit - 30 Sec Once > > At peak time if a shard goes to recovery mode many other shards also going > to recovery mode in few minutes, which creates huge load (200+ load > average) and SOLR becomes non responsive. To fix this we are restarting the > node, again leader tries to correct the index by initiating replication, > which causes load again, and the node goes to non responsive state. > > As soon as a node starts the replication process initiated for all 130 > cores, is there any we control it, like one after the other? > > Thanks, > Doss. >