Can anyone else chime in? Thanks
On Mon, Mar 24, 2014 at 10:10 AM, Software Dev <static.void....@gmail.com> wrote: > Shawn, > > Thanks for pointing me in the right direction. After consulting the > above document I *think* that the problem may be too large of a heap > and which may be affecting GC collection and hence causing ZK > timeouts. > > We have around 20G of memory on these machines with a min/max of heap > at 6, 8 respectively (-Xms6G -Xmx10G). The rest was allocated for > aside for disk cache. Why did we choose 6-10? No other reason than we > wanted to allot enough for disk cache and then everything else was > thrown and Solr. Does this sound about right? > > I took some screenshots for VisualVM and our NewRelic reporting as > well as some relevant portions of our SolrConfig.xml. Any > thoughts/comments would be greatly appreciated. > > http://postimg.org/gallery/4t73sdks/1fc10f9c/ > > Thanks > > > > > On Sat, Mar 22, 2014 at 2:26 PM, Shawn Heisey <s...@elyograg.org> wrote: >> On 3/22/2014 1:23 PM, Software Dev wrote: >>> We have 2 collections with 1 shard each replicated over 5 servers in the >>> cluster. We see a lot of flapping (down or recovering) on one of the >>> collections. When this happens the other collection hosted on the same >>> machine is still marked as active. When this happens it takes a fairly long >>> time (~30 minutes) for the collection to come back online, if at all. I >>> find that its usually more reliable to completely shutdown solr on the >>> affected machine and bring it back up with its core disabled. We then >>> re-enable the core when its marked as active. >>> >>> A few questions: >>> >>> 1) What is the healthcheck in Solr-Cloud? Put another way, what is failing >>> that marks one collection as down but the other on the same machine as up? >>> >>> 2) Why does recovery take forever when a node goes down.. even if its only >>> down for 30 seconds. Our index is only 7-8G and we are running on SSD's. >>> >>> 3) What can be done to diagnose and fix this problem? >> >> Unless you are actually using the ping request handler, the healthcheck >> config will not matter. Or were you referring to something else? >> >> Referencing the logs you included in your reply: The EofException >> errors happen because your client code times out and disconnects before >> the request it made has completed. That is most likely just a symptom >> that has nothing at all to do with the problem. >> >> Read the following wiki page. What I'm going to say below will >> reference information you can find there: >> >> http://wiki.apache.org/solr/SolrPerformanceProblems >> >> Relevant side note: The default zookeeper client timeout is 15 seconds. >> A typical zookeeper config defines tickTime as 2 seconds, and the >> timeout cannot be configured to be more than 20 times the tickTime, >> which means it cannot go beyond 40 seconds. The default timeout value >> 15 seconds is usually more than enough, unless you are having >> performance problems. >> >> If you are not actually taking Solr instances down, then the fact that >> you are seeing the log replay messages indicates to me that something is >> taking so much time that the connection to Zookeeper times out. When it >> finally responds, it will attempt to recover the index, which means >> first it will replay the transaction log and then it might replicate the >> index from the shard leader. >> >> Replaying the transaction log is likely the reason it takes so long to >> recover. The wiki page I linked above has a "slow startup" section that >> explains how to fix this. >> >> There is some kind of underlying problem that is causing the zookeeper >> connection to timeout. It is most likely garbage collection pauses or >> insufficient RAM to cache the index, possibly both. >> >> You did not indicate how much total RAM you have or how big your Java >> heap is. As the wiki page mentions in the SSD section, SSD is not a >> substitute for having enough RAM to cache at significant percentage of >> your index. >> >> Thanks, >> Shawn >>