On 3/22/2014 1:23 PM, Software Dev wrote: > We have 2 collections with 1 shard each replicated over 5 servers in the > cluster. We see a lot of flapping (down or recovering) on one of the > collections. When this happens the other collection hosted on the same > machine is still marked as active. When this happens it takes a fairly long > time (~30 minutes) for the collection to come back online, if at all. I > find that its usually more reliable to completely shutdown solr on the > affected machine and bring it back up with its core disabled. We then > re-enable the core when its marked as active. > > A few questions: > > 1) What is the healthcheck in Solr-Cloud? Put another way, what is failing > that marks one collection as down but the other on the same machine as up? > > 2) Why does recovery take forever when a node goes down.. even if its only > down for 30 seconds. Our index is only 7-8G and we are running on SSD's. > > 3) What can be done to diagnose and fix this problem?
Unless you are actually using the ping request handler, the healthcheck config will not matter. Or were you referring to something else? Referencing the logs you included in your reply: The EofException errors happen because your client code times out and disconnects before the request it made has completed. That is most likely just a symptom that has nothing at all to do with the problem. Read the following wiki page. What I'm going to say below will reference information you can find there: http://wiki.apache.org/solr/SolrPerformanceProblems Relevant side note: The default zookeeper client timeout is 15 seconds. A typical zookeeper config defines tickTime as 2 seconds, and the timeout cannot be configured to be more than 20 times the tickTime, which means it cannot go beyond 40 seconds. The default timeout value 15 seconds is usually more than enough, unless you are having performance problems. If you are not actually taking Solr instances down, then the fact that you are seeing the log replay messages indicates to me that something is taking so much time that the connection to Zookeeper times out. When it finally responds, it will attempt to recover the index, which means first it will replay the transaction log and then it might replicate the index from the shard leader. Replaying the transaction log is likely the reason it takes so long to recover. The wiki page I linked above has a "slow startup" section that explains how to fix this. There is some kind of underlying problem that is causing the zookeeper connection to timeout. It is most likely garbage collection pauses or insufficient RAM to cache the index, possibly both. You did not indicate how much total RAM you have or how big your Java heap is. As the wiki page mentions in the SSD section, SSD is not a substitute for having enough RAM to cache at significant percentage of your index. Thanks, Shawn