This night the problem occurred again and I have more data. This time this problem happened only in one solr server and it successfully recovered.
solr which had all the leaders: [06:38:58.205 - 06:38:58.222] Stopping recovery for zkNodeName=core_node2core=****** *- for all collections* [06:38:59.322 - 06:38:59.995] ContextcancelElection did not find election node to remove* - many times* [06:39:02.403] PeerSyncno frame of reference to tell if we've missed updates *- solr recovered* one of zookeepers (the one which is in the same server as solr which got the warnings): [06:38:58.099, 06:38:58.113, 06:38:58.114] org.apache.zookeeper.server.NIOServerCnxn: [ERROR] Unexpected Exception: java.nio.channels.CancelledKeyException *- 3 times* The other solr and other zookeepers haven't got any errors / warnings. Some monitoring data: Garbage Collectors Summary: https://apps.sematext.com/spm-reports/s/RYZxbcHXzu Pool Size: https://apps.sematext.com/spm-reports/s/N5c8QFc86d Pool Utilization: https://apps.sematext.com/spm-reports/s/B487KaWGXP Load: https://apps.sematext.com/spm-reports/s/ytfFzqYBl2 2014-03-24 17:39 GMT+02:00 Lukas Mikuckis <lukasmikuc...@gmail.com>: > We tried to set ZK timeout to 1s and did load testing (both indexing and > search) and this issue didn't happen. > > > 2014-03-24 17:00 GMT+02:00 Lukas Mikuckis <lukasmikuc...@gmail.com>: > > Garbage Collectors Summary: >> https://apps.sematext.com/spm-reports/s/rgRnwuShgI<https://app.getsignals.com/link?url=https%3A%2F%2Fapps.sematext.com%2Fspm-reports%2Fs%2FrgRnwuShgI&ukey=agxzfnNpZ25hbHNjcnhyGAsSC1VzZXJQcm9maWxlGICAgIa0jfILDA&k=26275c93-7d78-4359-c01e-afe10a004d52> >> >> Pool Size: >> https://apps.sematext.com/spm-reports/s/H16ndqichM<https://app.getsignals.com/link?url=https%3A%2F%2Fapps.sematext.com%2Fspm-reports%2Fs%2FH16ndqichM&ukey=agxzfnNpZ25hbHNjcnhyGAsSC1VzZXJQcm9maWxlGICAgIa0jfILDA&k=5027ed8d-cdc8-4e12-ea51-ea5677720d9a> >> >> First Stopping recovery warning: 4:00, OOM error: 6:30. >> >> >> 2014-03-24 16:35 GMT+02:00 Shalin Shekhar Mangar <shalinman...@gmail.com> >> : >> >> I am guessing that it is all related to memory issues. I guess that as >>> the used heap increases, full GC cycles increase causing ZK timeouts >>> which in turn cause more recoveries to be initiated. In the end, >>> everything blows up with the out of memory errors. Do you log GC >>> activity on your servers? >>> >>> I suggest that you rollback to 4.6.1 for now and upgrade to 4.7.1 when >>> it releases next week. >>> >>> On Mon, Mar 24, 2014 at 7:51 PM, Lukas Mikuckis <lukasmikuc...@gmail.com> >>> wrote: >>> > Yes, we upgraded solr from 4.6.1 to 4.7 3 weeks ago (2 weeks before >>> solr >>> > started crashing). >>> > When we were upgrading, we just upgraded solr and changed versions in >>> > collections configs. >>> > >>> > When solr crashes we get OOM but only 2h after first Stopping recovery >>> > warnings. >>> > >>> > Maybe you have any ideas when Stopping recovery warnings are thrown? >>> > Because now we have no idea what could cause this issue. >>> > >>> > Mon, 24 Mar 2014 04:03:17 GMT Shalin Shekhar Mangar < >>> shalinman...@gmail.com >>> >>: >>> >> >>> >> Did you upgrade recently to Solr 4.7? 4.7 has a bad bug which can >>> >> cause out of memory issues. Can you check your logs for out of memory >>> >> errors? >>> >> >>> >> On Sun, Mar 23, 2014 at 9:07 PM, Lukas Mikuckis < >>> lukasmikuc...@gmail.com> >>> > wrote: >>> >> > Solr version: 4.7 >>> >> > >>> >> > Architecture: >>> >> > 2 solrs (1 shard, leader + replica) >>> >> > 3 zookeepers >>> >> > >>> >> > Servers: >>> >> > * zookeeper + solr (heap 4gb) - RAM 8gb, 2 cpu cores >>> >> > * zookeeper + solr (heap 4gb) - RAM 8gb, 2 cpu cores >>> >> > * zookeeper >>> >> > >>> >> > Solr data: >>> >> > * 21 collections >>> >> > * Many fields, small docs, docs count per collection from 1k to 500k >>> >> > >>> >> > About a week ago solr started crashing. It crashes every day, 3-4 >>> times >>> > a >>> >> > day. Usually at nigh. I can't tell anything what could it be >>> related to >>> >> > because at that time we haven't done any configuration changes. Load >>> >> > haven't changed too. >>> >> > >>> >> > >>> >> > Everything starts with Stopping recovery for .. warnings (every >>> > warnings is >>> >> > repeated several times): >>> >> > >>> >> > WARN org.apache.solr.cloud.RecoveryStrategy; Stopping recovery for >>> >> > zkNodeName=core_node1core=****************** >>> >> > >>> >> > WARN org.apache.solr.cloud.ElectionContext; cancelElection did not >>> find >>> >> > election node to remove >>> >> > >>> >> > WARN org.apache.solr.update.PeerSync; no frame of reference to >>> tell if >>> >> > we've missed updates >>> >> > >>> >> > WARN - 2014-03-23 04:00:26.286; org.apache.solr.update.PeerSync; no >>> > frame >>> >> > of reference to tell if we've missed updates >>> >> > >>> >> > WARN - 2014-03-23 04:00:30.728; org.apache.solr.handler.SnapPuller; >>> > File >>> >> > _f9m_Lucene41_0.doc expected to be 6218278 while it is 7759879 >>> >> > >>> >> > WARN - 2014-03-23 04:00:54.126; >>> >> > org.apache.solr.update.UpdateLog$LogReplayer; Starting log replay >>> >> > >>> > >>> tlog{file=/path/solr/collection1_shard1_replica2/data/tlog/tlog.0000000000000003272 >>> >> > refcount=2} active=true starting pos=356216606 >>> >> > >>> >> > Then again Stopping recovery for .. warnings: >>> >> > >>> >> > WARN org.apache.solr.cloud.RecoveryStrategy; Stopping recovery for >>> >> > zkNodeName=core_node1core=****************** >>> >> > >>> >> > ERROR - 2014-03-23 05:19:29.566; >>> org.apache.solr.common.SolrException; >>> >> > org.apache.solr.common.SolrException: No registered leader was found >>> > after >>> >> > waiting for 4000ms , collection: collection1 slice: shard1 >>> >> > >>> >> > ERROR - 2014-03-23 05:20:03.961; >>> org.apache.solr.common.SolrException; >>> >> > org.apache.solr.common.SolrException: I was asked to wait on state >>> down >>> > for >>> >> > IP:PORT_solr but I still do not see the requested state. I see >>> state: >>> >> > active live:false >>> >> > >>> >> > >>> >> > After this serves mostly didn't recover. >>> >> >>> >> >>> >> >>> >> -- >>> >> Regards, >>> >> Shalin Shekhar Mangar. >>> >> >>> >> >>> >>> >>> >>> -- >>> Regards, >>> Shalin Shekhar Mangar. >>> >> >> >