What kind of load are the machines under when this happens? A lot of writes? A lot of http connections?
Do your zookeeper logs mention anything about losing clients? Have you tried turning on GC logging or profiling GC? Have you tried running with a smaller max heap size, or setting -XX:CMSInitiatingOccupancyFraction ? Just a shot in the dark, since I'm not familiar with Jetty's logging statements, but that looks like plain old dropped HTTP sockets to me. Michael Della Bitta Applications Developer o: +1 646 532 3062 appinions inc. "The Science of Influence Marketing" 18 East 41st Street New York, NY 10017 t: @appinions <https://twitter.com/Appinions> | g+: plus.google.com/appinions<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts> w: appinions.com <http://www.appinions.com/> On Tue, Mar 25, 2014 at 1:13 PM, Software Dev <static.void....@gmail.com>wrote: > Can anyone else chime in? Thanks > > On Mon, Mar 24, 2014 at 10:10 AM, Software Dev > <static.void....@gmail.com> wrote: > > Shawn, > > > > Thanks for pointing me in the right direction. After consulting the > > above document I *think* that the problem may be too large of a heap > > and which may be affecting GC collection and hence causing ZK > > timeouts. > > > > We have around 20G of memory on these machines with a min/max of heap > > at 6, 8 respectively (-Xms6G -Xmx10G). The rest was allocated for > > aside for disk cache. Why did we choose 6-10? No other reason than we > > wanted to allot enough for disk cache and then everything else was > > thrown and Solr. Does this sound about right? > > > > I took some screenshots for VisualVM and our NewRelic reporting as > > well as some relevant portions of our SolrConfig.xml. Any > > thoughts/comments would be greatly appreciated. > > > > http://postimg.org/gallery/4t73sdks/1fc10f9c/ > > > > Thanks > > > > > > > > > > On Sat, Mar 22, 2014 at 2:26 PM, Shawn Heisey <s...@elyograg.org> wrote: > >> On 3/22/2014 1:23 PM, Software Dev wrote: > >>> We have 2 collections with 1 shard each replicated over 5 servers in > the > >>> cluster. We see a lot of flapping (down or recovering) on one of the > >>> collections. When this happens the other collection hosted on the same > >>> machine is still marked as active. When this happens it takes a fairly > long > >>> time (~30 minutes) for the collection to come back online, if at all. I > >>> find that its usually more reliable to completely shutdown solr on the > >>> affected machine and bring it back up with its core disabled. We then > >>> re-enable the core when its marked as active. > >>> > >>> A few questions: > >>> > >>> 1) What is the healthcheck in Solr-Cloud? Put another way, what is > failing > >>> that marks one collection as down but the other on the same machine as > up? > >>> > >>> 2) Why does recovery take forever when a node goes down.. even if its > only > >>> down for 30 seconds. Our index is only 7-8G and we are running on > SSD's. > >>> > >>> 3) What can be done to diagnose and fix this problem? > >> > >> Unless you are actually using the ping request handler, the healthcheck > >> config will not matter. Or were you referring to something else? > >> > >> Referencing the logs you included in your reply: The EofException > >> errors happen because your client code times out and disconnects before > >> the request it made has completed. That is most likely just a symptom > >> that has nothing at all to do with the problem. > >> > >> Read the following wiki page. What I'm going to say below will > >> reference information you can find there: > >> > >> http://wiki.apache.org/solr/SolrPerformanceProblems > >> > >> Relevant side note: The default zookeeper client timeout is 15 seconds. > >> A typical zookeeper config defines tickTime as 2 seconds, and the > >> timeout cannot be configured to be more than 20 times the tickTime, > >> which means it cannot go beyond 40 seconds. The default timeout value > >> 15 seconds is usually more than enough, unless you are having > >> performance problems. > >> > >> If you are not actually taking Solr instances down, then the fact that > >> you are seeing the log replay messages indicates to me that something is > >> taking so much time that the connection to Zookeeper times out. When it > >> finally responds, it will attempt to recover the index, which means > >> first it will replay the transaction log and then it might replicate the > >> index from the shard leader. > >> > >> Replaying the transaction log is likely the reason it takes so long to > >> recover. The wiki page I linked above has a "slow startup" section that > >> explains how to fix this. > >> > >> There is some kind of underlying problem that is causing the zookeeper > >> connection to timeout. It is most likely garbage collection pauses or > >> insufficient RAM to cache the index, possibly both. > >> > >> You did not indicate how much total RAM you have or how big your Java > >> heap is. As the wiki page mentions in the SSD section, SSD is not a > >> substitute for having enough RAM to cache at significant percentage of > >> your index. > >> > >> Thanks, > >> Shawn > >> >