Can anyone else chime in? Thanks

On Mon, Mar 24, 2014 at 10:10 AM, Software Dev
<static.void....@gmail.com> wrote:
> Shawn,
>
> Thanks for pointing me in the right direction. After consulting the
> above document I *think* that the problem may be too large of a heap
> and which may be affecting GC collection and hence causing ZK
> timeouts.
>
> We have around 20G of memory on these machines with a min/max of heap
> at 6, 8 respectively (-Xms6G -Xmx10G). The rest was allocated for
> aside for disk cache. Why did we choose 6-10? No other reason than we
> wanted to allot enough for disk cache and then everything else was
> thrown and Solr. Does this sound about right?
>
> I took some screenshots for VisualVM and our NewRelic reporting as
> well as some relevant portions of our SolrConfig.xml. Any
> thoughts/comments would be greatly appreciated.
>
> http://postimg.org/gallery/4t73sdks/1fc10f9c/
>
> Thanks
>
>
>
>
> On Sat, Mar 22, 2014 at 2:26 PM, Shawn Heisey <s...@elyograg.org> wrote:
>> On 3/22/2014 1:23 PM, Software Dev wrote:
>>> We have 2 collections with 1 shard each replicated over 5 servers in the
>>> cluster. We see a lot of flapping (down or recovering) on one of the
>>> collections. When this happens the other collection hosted on the same
>>> machine is still marked as active. When this happens it takes a fairly long
>>> time (~30 minutes) for the collection to come back online, if at all. I
>>> find that its usually more reliable to completely shutdown solr on the
>>> affected machine and bring it back up with its core disabled. We then
>>> re-enable the core when its marked as active.
>>>
>>> A few questions:
>>>
>>> 1) What is the healthcheck in Solr-Cloud? Put another way, what is failing
>>> that marks one collection as down but the other on the same machine as up?
>>>
>>> 2) Why does recovery take forever when a node goes down.. even if its only
>>> down for 30 seconds. Our index is only 7-8G and we are running on SSD's.
>>>
>>> 3) What can be done to diagnose and fix this problem?
>>
>> Unless you are actually using the ping request handler, the healthcheck
>> config will not matter.  Or were you referring to something else?
>>
>> Referencing the logs you included in your reply:  The EofException
>> errors happen because your client code times out and disconnects before
>> the request it made has completed.  That is most likely just a symptom
>> that has nothing at all to do with the problem.
>>
>> Read the following wiki page.  What I'm going to say below will
>> reference information you can find there:
>>
>> http://wiki.apache.org/solr/SolrPerformanceProblems
>>
>> Relevant side note: The default zookeeper client timeout is 15 seconds.
>>  A typical zookeeper config defines tickTime as 2 seconds, and the
>> timeout cannot be configured to be more than 20 times the tickTime,
>> which means it cannot go beyond 40 seconds.  The default timeout value
>> 15 seconds is usually more than enough, unless you are having
>> performance problems.
>>
>> If you are not actually taking Solr instances down, then the fact that
>> you are seeing the log replay messages indicates to me that something is
>> taking so much time that the connection to Zookeeper times out.  When it
>> finally responds, it will attempt to recover the index, which means
>> first it will replay the transaction log and then it might replicate the
>> index from the shard leader.
>>
>> Replaying the transaction log is likely the reason it takes so long to
>> recover.  The wiki page I linked above has a "slow startup" section that
>> explains how to fix this.
>>
>> There is some kind of underlying problem that is causing the zookeeper
>> connection to timeout.  It is most likely garbage collection pauses or
>> insufficient RAM to cache the index, possibly both.
>>
>> You did not indicate how much total RAM you have or how big your Java
>> heap is.  As the wiki page mentions in the SSD section, SSD is not a
>> substitute for having enough RAM to cache at significant percentage of
>> your index.
>>
>> Thanks,
>> Shawn
>>

Reply via email to