On 10/2/2018 8:55 PM, Ganesh Sethuraman wrote:
We are using 2 node SolrCloud 7.2.1 cluster with external 3 node ZK
ensemble in AWS. There are about 60 collections at any point in time. We
have per JVM max heap of 8GB.

Let's focus for right now on a single Solr machine, rather than the whole cluster.  How many shard replicas (cores) are on one server?  How much disk space does all the index data take? How many documents (maxDoc, which includes deleted docs) are in all those cores?  What is the total amount of RAM on the server? Is there any other software besides Solr running on each server?

https://wiki.apache.org/solr/SolrPerformanceProblems#Asking_for_help_on_a_memory.2Fperformance_issue

But as stated above problem, we will have few collection replicas in the
recovering and down state. In the past we have seen it come back to normal
by restarting the solr server, but we want to understand is there any way
to get this back to normal (all synched up with Zookeeper) through command
line/admin? Another question is, being in this state can it cause data
issue? How do we check that (distrib=false on collection count?)?

As long as you have at least one replica operational on every shard, you should be OK.  But if you only have one replica operational, then you're in a precarious state, where one additional problem could result in something being unavailable.

If all is well, SolrCloud should not have replicas stay in down or recovering state for very long, unless they're really large, in which case it can take a while to copy the data from the leader.  If that state persists for a long time, there's probably something going wrong with your Solr install.  Usually restarting Solr is the only way to recover persistently down replicas.  If it happens again after restart, then the root problem has not been dealt with, and you're going to need to figure it out.

The log snippet you shared only covers a timespan of less than one second, so it's not very helpful in making any kind of determination.  The "session expired" message sounds like what happens when the zkClientTimeout value is exceeded.  Internally, this value defaults to 15 seconds, and typical example configs set it to 30 seconds ... so when the session expires, it means there's a SERIOUS problem.  For computer software, 15 or 30 seconds is a relative eternity.  A properly running system should NEVER exceed that timeout.

Can you share your solr log when the problem happens, covering a timespan of at least a few minutes (and ideally much longer), as well as a gc log from a time when Solr was up for a long time?  Hopefully the solr.log and gc log will cover the same timeframe.  You'll need to use a file sharing site for the GC log, since it's likely to be a large file.  I would suggest compressing it.  If the solr.log is small enough, you could use a paste website for that, but if it's large, you'll need to use a file sharing site.  Attachments to list email are almost never preserved.

Thanks,
Shawn

Reply via email to