Re: SolrCloud from "Stopping recovery for" warnings to crash

Lukas Mikuckis Tue, 25 Mar 2014 08:42:43 -0700

This night the problem occurred again and I have more data. This time this
problem happened only in one solr server and it successfully recovered.


solr which had all the leaders:

[06:38:58.205 - 06:38:58.222] Stopping recovery for
zkNodeName=core_node2core=****** *- for all collections*

[06:38:59.322 - 06:38:59.995] ContextcancelElection did not find election
node to remove* - many times*

[06:39:02.403] PeerSyncno frame of reference to tell if we've missed
updates *- solr recovered*


one of zookeepers (the one which is in the same server as solr which got
the warnings):

[06:38:58.099, 06:38:58.113, 06:38:58.114]
org.apache.zookeeper.server.NIOServerCnxn: [ERROR] Unexpected Exception:
java.nio.channels.CancelledKeyException *- 3 times*


The other solr and other zookeepers haven't got any errors / warnings.


Some monitoring data:

Garbage Collectors Summary:
https://apps.sematext.com/spm-reports/s/RYZxbcHXzu

Pool Size:
https://apps.sematext.com/spm-reports/s/N5c8QFc86d

Pool Utilization:
https://apps.sematext.com/spm-reports/s/B487KaWGXP

Load:
https://apps.sematext.com/spm-reports/s/ytfFzqYBl2



2014-03-24 17:39 GMT+02:00 Lukas Mikuckis <[email protected]>:

> We tried to set ZK timeout to 1s and did load testing (both indexing and
> search) and this issue didn't happen.
>
>
> 2014-03-24 17:00 GMT+02:00 Lukas Mikuckis <[email protected]>:
>
> Garbage Collectors Summary:
>> https://apps.sematext.com/spm-reports/s/rgRnwuShgI<https://app.getsignals.com/link?url=https%3A%2F%2Fapps.sematext.com%2Fspm-reports%2Fs%2FrgRnwuShgI&ukey=agxzfnNpZ25hbHNjcnhyGAsSC1VzZXJQcm9maWxlGICAgIa0jfILDA&k=26275c93-7d78-4359-c01e-afe10a004d52>
>>
>> Pool Size:
>> https://apps.sematext.com/spm-reports/s/H16ndqichM<https://app.getsignals.com/link?url=https%3A%2F%2Fapps.sematext.com%2Fspm-reports%2Fs%2FH16ndqichM&ukey=agxzfnNpZ25hbHNjcnhyGAsSC1VzZXJQcm9maWxlGICAgIa0jfILDA&k=5027ed8d-cdc8-4e12-ea51-ea5677720d9a>
>>
>> First Stopping recovery warning: 4:00, OOM error: 6:30.
>>
>>
>> 2014-03-24 16:35 GMT+02:00 Shalin Shekhar Mangar <[email protected]>
>> :
>>
>> I am guessing that it is all related to memory issues. I guess that as
>>> the used heap increases, full GC cycles increase causing ZK timeouts
>>> which in turn cause more recoveries to be initiated. In the end,
>>> everything blows up with the out of memory errors. Do you log GC
>>> activity on your servers?
>>>
>>> I suggest that you rollback to 4.6.1 for now and upgrade to 4.7.1 when
>>> it releases next week.
>>>
>>> On Mon, Mar 24, 2014 at 7:51 PM, Lukas Mikuckis <[email protected]>
>>> wrote:
>>> > Yes, we upgraded solr from 4.6.1 to 4.7 3 weeks ago (2 weeks before
>>> solr
>>> > started crashing).
>>> > When we were upgrading, we just upgraded solr and changed versions in
>>> > collections configs.
>>> >
>>> > When solr crashes we get OOM but only 2h after first Stopping recovery
>>> > warnings.
>>> >
>>> > Maybe you have any ideas when Stopping recovery warnings are thrown?
>>> > Because now we have no idea what could cause this issue.
>>> >
>>> > Mon, 24 Mar 2014 04:03:17 GMT Shalin Shekhar Mangar <
>>> [email protected]
>>> >>:
>>> >>
>>> >> Did you upgrade recently to Solr 4.7? 4.7 has a bad bug which can
>>> >> cause out of memory issues. Can you check your logs for out of memory
>>> >> errors?
>>> >>
>>> >> On Sun, Mar 23, 2014 at 9:07 PM, Lukas Mikuckis <
>>> [email protected]>
>>> > wrote:
>>> >> > Solr version: 4.7
>>> >> >
>>> >> > Architecture:
>>> >> > 2 solrs (1 shard, leader + replica)
>>> >> > 3 zookeepers
>>> >> >
>>> >> > Servers:
>>> >> > * zookeeper + solr (heap 4gb) - RAM 8gb, 2 cpu cores
>>> >> > * zookeeper + solr  (heap 4gb) - RAM 8gb, 2 cpu cores
>>> >> > * zookeeper
>>> >> >
>>> >> > Solr data:
>>> >> > * 21 collections
>>> >> > * Many fields, small docs, docs count per collection from 1k to 500k
>>> >> >
>>> >> > About a week ago solr started crashing. It crashes every day, 3-4
>>> times
>>> > a
>>> >> > day. Usually at nigh. I can't tell anything what could it be
>>> related to
>>> >> > because at that time we haven't done any configuration changes. Load
>>> >> > haven't changed too.
>>> >> >
>>> >> >
>>> >> > Everything starts with Stopping recovery for .. warnings (every
>>> > warnings is
>>> >> > repeated several times):
>>> >> >
>>> >> > WARN  org.apache.solr.cloud.RecoveryStrategy; Stopping recovery for
>>> >> > zkNodeName=core_node1core=******************
>>> >> >
>>> >> > WARN  org.apache.solr.cloud.ElectionContext; cancelElection did not
>>> find
>>> >> > election node to remove
>>> >> >
>>> >> > WARN  org.apache.solr.update.PeerSync; no frame of reference to
>>> tell if
>>> >> > we've missed updates
>>> >> >
>>> >> > WARN  - 2014-03-23 04:00:26.286; org.apache.solr.update.PeerSync; no
>>> > frame
>>> >> > of reference to tell if we've missed updates
>>> >> >
>>> >> > WARN  - 2014-03-23 04:00:30.728; org.apache.solr.handler.SnapPuller;
>>> > File
>>> >> > _f9m_Lucene41_0.doc expected to be 6218278 while it is 7759879
>>> >> >
>>> >> > WARN  - 2014-03-23 04:00:54.126;
>>> >> > org.apache.solr.update.UpdateLog$LogReplayer; Starting log replay
>>> >> >
>>> >
>>> tlog{file=/path/solr/collection1_shard1_replica2/data/tlog/tlog.0000000000000003272
>>> >> > refcount=2} active=true starting pos=356216606
>>> >> >
>>> >> > Then again Stopping recovery for .. warnings:
>>> >> >
>>> >> > WARN  org.apache.solr.cloud.RecoveryStrategy; Stopping recovery for
>>> >> > zkNodeName=core_node1core=******************
>>> >> >
>>> >> > ERROR - 2014-03-23 05:19:29.566;
>>> org.apache.solr.common.SolrException;
>>> >> > org.apache.solr.common.SolrException: No registered leader was found
>>> > after
>>> >> > waiting for 4000ms , collection: collection1 slice: shard1
>>> >> >
>>> >> > ERROR - 2014-03-23 05:20:03.961;
>>> org.apache.solr.common.SolrException;
>>> >> > org.apache.solr.common.SolrException: I was asked to wait on state
>>> down
>>> > for
>>> >> > IP:PORT_solr but I still do not see the requested state. I see
>>> state:
>>> >> > active live:false
>>> >> >
>>> >> >
>>> >> > After this serves mostly didn't recover.
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Regards,
>>> >> Shalin Shekhar Mangar.
>>> >>
>>> >>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Shalin Shekhar Mangar.
>>>
>>
>>
>

Re: SolrCloud from "Stopping recovery for" warnings to crash

Reply via email to