On 10/5/2018 5:15 AM, Ganesh Sethuraman wrote:
1. Does GC and Solr Logs help to why the Solr replicas server continues to
be in the recovering/ state? Our assumption is that Sept 17 16:00 hrs we
had done ZK transaction log reading, that might have caused the issue. Is
that correct?
2. Does this state can cause slowness to Solr Queries for reads?
3. Is there any way to get notified/email if the servers has any replica
gets into the recovery mode?

Seeing the GC log and Solr log will allow us to look for problems.  It won't solve anything, it just lets us examine the situation, see if there is any evidence to point to the root issue and maybe a solution.

If you're running with a heap that's too small, you can get into a situation where you never actually run out of memory, but the amount of available memory is so small that Java must continually run full garbage collections to keep enough of it free for the program to stay running.  This can happen to ANY java program, including your ZK servers.

If that happens, the program itself will only be running a small percentage of the time, and there will be extremely long pauses where very little happens other than garbage collection, and then when the program starts running again, it realizes that its timeouts have been exceeded, which in SolrCloud, will initiate recovery operations ... and that will probably keep the GC pause storm happening.

With an 8 GB heap and likely billions of documents being handled by one Solr instance, that low-memory situation I just described seems very possible.  The solution is to make the heap bigger.  Your Solr install is very large ... it seems unlikely to me that 8GB would be enough.  Solr is not typically a memory hog kind of application, if what it is asked to do is small.  When it is asked to do a bigger job, more memory will be required.

Running without sufficient system memory to effectively cache the indexes that are actively used can also cause performance problems.  This is memory *NOT* allocated to programs like Solr, that the OS is free to use for caching purposes.  With a busy enough server, performance problems caused by that can spiral and lead to SolrCloud recovery issues.

Thanks,
Shawn

Reply via email to