Reading the ZK transaction log could be issue, as ZK seems to be sensitive to this ( https://zookeeper.apache.org/doc/r3.1.2/zookeeperAdmin.html#The+Log+Directory )
> incorrect placement of transasction log > The most performance critical part of ZooKeeper is the transaction log. > ZooKeeper syncs transactions to media before it returns a response. A > dedicated transaction log device is key to consistent good performance. > Putting the log on a busy device will adversely effect performance. If you > only have one storage device, put trace files on NFS and increase the > snapshotCount; it doesn't eliminate the problem, but it should mitigate it. I am not sure the logs and GC logs were evident from my previous mail. Re-posting it here for your reference: Here is the full Solr Log file (Note that it is in INFO mode): https://raw.githubusercontent.com/ganeshmailbox/har/master/SolrLogFile Here is the GC Log: http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMTAvMy8tLTAxX3NvbHJfZ2MubG9nLjUtLTIxLTE5LTU3 Thanks Ganesh On Fri, Oct 5, 2018 at 10:13 AM Shawn Heisey <apa...@elyograg.org> wrote: > On 10/5/2018 5:15 AM, Ganesh Sethuraman wrote: > > 1. Does GC and Solr Logs help to why the Solr replicas server continues > to > > be in the recovering/ state? Our assumption is that Sept 17 16:00 hrs we > > had done ZK transaction log reading, that might have caused the issue. Is > > that correct? > > 2. Does this state can cause slowness to Solr Queries for reads? > > 3. Is there any way to get notified/email if the servers has any replica > > gets into the recovery mode? > > Seeing the GC log and Solr log will allow us to look for problems. It > won't solve anything, it just lets us examine the situation, see if > there is any evidence to point to the root issue and maybe a solution. > > If you're running with a heap that's too small, you can get into a > situation where you never actually run out of memory, but the amount of > available memory is so small that Java must continually run full garbage > collections to keep enough of it free for the program to stay running. > This can happen to ANY java program, including your ZK servers. > > If that happens, the program itself will only be running a small > percentage of the time, and there will be extremely long pauses where > very little happens other than garbage collection, and then when the > program starts running again, it realizes that its timeouts have been > exceeded, which in SolrCloud, will initiate recovery operations ... and > that will probably keep the GC pause storm happening. > > With an 8 GB heap and likely billions of documents being handled by one > Solr instance, that low-memory situation I just described seems very > possible. The solution is to make the heap bigger. Your Solr install > is very large ... it seems unlikely to me that 8GB would be enough. > Solr is not typically a memory hog kind of application, if what it is > asked to do is small. When it is asked to do a bigger job, more memory > will be required. > > Running without sufficient system memory to effectively cache the > indexes that are actively used can also cause performance problems. > This is memory *NOT* allocated to programs like Solr, that the OS is > free to use for caching purposes. With a busy enough server, > performance problems caused by that can spiral and lead to SolrCloud > recovery issues. > > Thanks, > Shawn > >