Re: Solr Cloud collection keep going down?

Shawn Heisey Sat, 22 Mar 2014 14:27:27 -0700

On 3/22/2014 1:23 PM, Software Dev wrote:
> We have 2 collections with 1 shard each replicated over 5 servers in the
> cluster. We see a lot of flapping (down or recovering) on one of the
> collections. When this happens the other collection hosted on the same
> machine is still marked as active. When this happens it takes a fairly long
> time (~30 minutes) for the collection to come back online, if at all. I
> find that its usually more reliable to completely shutdown solr on the
> affected machine and bring it back up with its core disabled. We then
> re-enable the core when its marked as active.
> 
> A few questions:
> 
> 1) What is the healthcheck in Solr-Cloud? Put another way, what is failing
> that marks one collection as down but the other on the same machine as up?
> 
> 2) Why does recovery take forever when a node goes down.. even if its only
> down for 30 seconds. Our index is only 7-8G and we are running on SSD's.
> 
> 3) What can be done to diagnose and fix this problem?


Unless you are actually using the ping request handler, the healthcheck
config will not matter.  Or were you referring to something else?

Referencing the logs you included in your reply:  The EofException
errors happen because your client code times out and disconnects before
the request it made has completed.  That is most likely just a symptom
that has nothing at all to do with the problem.

Read the following wiki page.  What I'm going to say below will
reference information you can find there:

http://wiki.apache.org/solr/SolrPerformanceProblems

Relevant side note: The default zookeeper client timeout is 15 seconds.
 A typical zookeeper config defines tickTime as 2 seconds, and the
timeout cannot be configured to be more than 20 times the tickTime,
which means it cannot go beyond 40 seconds.  The default timeout value
15 seconds is usually more than enough, unless you are having
performance problems.

If you are not actually taking Solr instances down, then the fact that
you are seeing the log replay messages indicates to me that something is
taking so much time that the connection to Zookeeper times out.  When it
finally responds, it will attempt to recover the index, which means
first it will replay the transaction log and then it might replicate the
index from the shard leader.

Replaying the transaction log is likely the reason it takes so long to
recover.  The wiki page I linked above has a "slow startup" section that
explains how to fix this.

There is some kind of underlying problem that is causing the zookeeper
connection to timeout.  It is most likely garbage collection pauses or
insufficient RAM to cache the index, possibly both.

You did not indicate how much total RAM you have or how big your Java
heap is.  As the wiki page mentions in the SSD section, SSD is not a
substitute for having enough RAM to cache at significant percentage of
your index.

Thanks,
Shawn

Re: Solr Cloud collection keep going down?

Reply via email to