Re: Solr Cloud in recovering state & down state for long

Shawn Heisey Tue, 02 Oct 2018 20:46:23 -0700

On 10/2/2018 8:55 PM, Ganesh Sethuraman wrote:

We are using 2 node SolrCloud 7.2.1 cluster with external 3 node ZK
ensemble in AWS. There are about 60 collections at any point in time. We
have per JVM max heap of 8GB.

Let's focus for right now on a single Solr machine, rather than thewhole cluster. How many shard replicas (cores) are on one server? Howmuch disk space does all the index data take? How many documents(maxDoc, which includes deleted docs) are in all those cores? What isthe total amount of RAM on the server? Is there any other softwarebesides Solr running on each server?


https://wiki.apache.org/solr/SolrPerformanceProblems#Asking_for_help_on_a_memory.2Fperformance_issue

But as stated above problem, we will have few collection replicas in the
recovering and down state. In the past we have seen it come back to normal
by restarting the solr server, but we want to understand is there any way
to get this back to normal (all synched up with Zookeeper) through command
line/admin? Another question is, being in this state can it cause data
issue? How do we check that (distrib=false on collection count?)?

As long as you have at least one replica operational on every shard, youshould be OK. But if you only have one replica operational, then you'rein a precarious state, where one additional problem could result insomething being unavailable.

If all is well, SolrCloud should not have replicas stay in down orrecovering state for very long, unless they're really large, in whichcase it can take a while to copy the data from the leader. If thatstate persists for a long time, there's probably something going wrongwith your Solr install. Usually restarting Solr is the only way torecover persistently down replicas. If it happens again after restart,then the root problem has not been dealt with, and you're going to needto figure it out.

The log snippet you shared only covers a timespan of less than onesecond, so it's not very helpful in making any kind of determination. The "session expired" message sounds like what happens when thezkClientTimeout value is exceeded. Internally, this value defaults to15 seconds, and typical example configs set it to 30 seconds ... so whenthe session expires, it means there's a SERIOUS problem. For computersoftware, 15 or 30 seconds is a relative eternity. A properly runningsystem should NEVER exceed that timeout.

Can you share your solr log when the problem happens, covering atimespan of at least a few minutes (and ideally much longer), as well asa gc log from a time when Solr was up for a long time? Hopefully thesolr.log and gc log will cover the same timeframe. You'll need to use afile sharing site for the GC log, since it's likely to be a large file. I would suggest compressing it. If the solr.log is small enough, youcould use a paste website for that, but if it's large, you'll need touse a file sharing site. Attachments to list email are almost neverpreserved.


Thanks,
Shawn

Re: Solr Cloud in recovering state & down state for long

Reply via email to