What kind of load are the machines under when this happens? A lot of
writes? A lot of http connections?

Do your zookeeper logs mention anything about losing clients?

Have you tried turning on GC logging or profiling GC?

Have you tried running with a smaller max heap size, or
setting -XX:CMSInitiatingOccupancyFraction ?

Just a shot in the dark, since I'm not familiar with Jetty's logging
statements, but that looks like plain old dropped HTTP sockets to me.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

"The Science of Influence Marketing"

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Tue, Mar 25, 2014 at 1:13 PM, Software Dev <static.void....@gmail.com>wrote:

> Can anyone else chime in? Thanks
>
> On Mon, Mar 24, 2014 at 10:10 AM, Software Dev
> <static.void....@gmail.com> wrote:
> > Shawn,
> >
> > Thanks for pointing me in the right direction. After consulting the
> > above document I *think* that the problem may be too large of a heap
> > and which may be affecting GC collection and hence causing ZK
> > timeouts.
> >
> > We have around 20G of memory on these machines with a min/max of heap
> > at 6, 8 respectively (-Xms6G -Xmx10G). The rest was allocated for
> > aside for disk cache. Why did we choose 6-10? No other reason than we
> > wanted to allot enough for disk cache and then everything else was
> > thrown and Solr. Does this sound about right?
> >
> > I took some screenshots for VisualVM and our NewRelic reporting as
> > well as some relevant portions of our SolrConfig.xml. Any
> > thoughts/comments would be greatly appreciated.
> >
> > http://postimg.org/gallery/4t73sdks/1fc10f9c/
> >
> > Thanks
> >
> >
> >
> >
> > On Sat, Mar 22, 2014 at 2:26 PM, Shawn Heisey <s...@elyograg.org> wrote:
> >> On 3/22/2014 1:23 PM, Software Dev wrote:
> >>> We have 2 collections with 1 shard each replicated over 5 servers in
> the
> >>> cluster. We see a lot of flapping (down or recovering) on one of the
> >>> collections. When this happens the other collection hosted on the same
> >>> machine is still marked as active. When this happens it takes a fairly
> long
> >>> time (~30 minutes) for the collection to come back online, if at all. I
> >>> find that its usually more reliable to completely shutdown solr on the
> >>> affected machine and bring it back up with its core disabled. We then
> >>> re-enable the core when its marked as active.
> >>>
> >>> A few questions:
> >>>
> >>> 1) What is the healthcheck in Solr-Cloud? Put another way, what is
> failing
> >>> that marks one collection as down but the other on the same machine as
> up?
> >>>
> >>> 2) Why does recovery take forever when a node goes down.. even if its
> only
> >>> down for 30 seconds. Our index is only 7-8G and we are running on
> SSD's.
> >>>
> >>> 3) What can be done to diagnose and fix this problem?
> >>
> >> Unless you are actually using the ping request handler, the healthcheck
> >> config will not matter.  Or were you referring to something else?
> >>
> >> Referencing the logs you included in your reply:  The EofException
> >> errors happen because your client code times out and disconnects before
> >> the request it made has completed.  That is most likely just a symptom
> >> that has nothing at all to do with the problem.
> >>
> >> Read the following wiki page.  What I'm going to say below will
> >> reference information you can find there:
> >>
> >> http://wiki.apache.org/solr/SolrPerformanceProblems
> >>
> >> Relevant side note: The default zookeeper client timeout is 15 seconds.
> >>  A typical zookeeper config defines tickTime as 2 seconds, and the
> >> timeout cannot be configured to be more than 20 times the tickTime,
> >> which means it cannot go beyond 40 seconds.  The default timeout value
> >> 15 seconds is usually more than enough, unless you are having
> >> performance problems.
> >>
> >> If you are not actually taking Solr instances down, then the fact that
> >> you are seeing the log replay messages indicates to me that something is
> >> taking so much time that the connection to Zookeeper times out.  When it
> >> finally responds, it will attempt to recover the index, which means
> >> first it will replay the transaction log and then it might replicate the
> >> index from the shard leader.
> >>
> >> Replaying the transaction log is likely the reason it takes so long to
> >> recover.  The wiki page I linked above has a "slow startup" section that
> >> explains how to fix this.
> >>
> >> There is some kind of underlying problem that is causing the zookeeper
> >> connection to timeout.  It is most likely garbage collection pauses or
> >> insufficient RAM to cache the index, possibly both.
> >>
> >> You did not indicate how much total RAM you have or how big your Java
> >> heap is.  As the wiki page mentions in the SSD section, SSD is not a
> >> substitute for having enough RAM to cache at significant percentage of
> >> your index.
> >>
> >> Thanks,
> >> Shawn
> >>
>

Reply via email to