We tried to set ZK timeout to 1s and did load testing (both indexing and search) and this issue didn't happen.
2014-03-24 17:00 GMT+02:00 Lukas Mikuckis <lukasmikuc...@gmail.com>: > Garbage Collectors Summary: > https://apps.sematext.com/spm-reports/s/rgRnwuShgI<https://app.getsignals.com/link?url=https%3A%2F%2Fapps.sematext.com%2Fspm-reports%2Fs%2FrgRnwuShgI&ukey=agxzfnNpZ25hbHNjcnhyGAsSC1VzZXJQcm9maWxlGICAgIa0jfILDA&k=26275c93-7d78-4359-c01e-afe10a004d52> > > Pool Size: > https://apps.sematext.com/spm-reports/s/H16ndqichM<https://app.getsignals.com/link?url=https%3A%2F%2Fapps.sematext.com%2Fspm-reports%2Fs%2FH16ndqichM&ukey=agxzfnNpZ25hbHNjcnhyGAsSC1VzZXJQcm9maWxlGICAgIa0jfILDA&k=5027ed8d-cdc8-4e12-ea51-ea5677720d9a> > > First Stopping recovery warning: 4:00, OOM error: 6:30. > > > 2014-03-24 16:35 GMT+02:00 Shalin Shekhar Mangar <shalinman...@gmail.com>: > > I am guessing that it is all related to memory issues. I guess that as >> the used heap increases, full GC cycles increase causing ZK timeouts >> which in turn cause more recoveries to be initiated. In the end, >> everything blows up with the out of memory errors. Do you log GC >> activity on your servers? >> >> I suggest that you rollback to 4.6.1 for now and upgrade to 4.7.1 when >> it releases next week. >> >> On Mon, Mar 24, 2014 at 7:51 PM, Lukas Mikuckis <lukasmikuc...@gmail.com> >> wrote: >> > Yes, we upgraded solr from 4.6.1 to 4.7 3 weeks ago (2 weeks before solr >> > started crashing). >> > When we were upgrading, we just upgraded solr and changed versions in >> > collections configs. >> > >> > When solr crashes we get OOM but only 2h after first Stopping recovery >> > warnings. >> > >> > Maybe you have any ideas when Stopping recovery warnings are thrown? >> > Because now we have no idea what could cause this issue. >> > >> > Mon, 24 Mar 2014 04:03:17 GMT Shalin Shekhar Mangar < >> shalinman...@gmail.com >> >>: >> >> >> >> Did you upgrade recently to Solr 4.7? 4.7 has a bad bug which can >> >> cause out of memory issues. Can you check your logs for out of memory >> >> errors? >> >> >> >> On Sun, Mar 23, 2014 at 9:07 PM, Lukas Mikuckis < >> lukasmikuc...@gmail.com> >> > wrote: >> >> > Solr version: 4.7 >> >> > >> >> > Architecture: >> >> > 2 solrs (1 shard, leader + replica) >> >> > 3 zookeepers >> >> > >> >> > Servers: >> >> > * zookeeper + solr (heap 4gb) - RAM 8gb, 2 cpu cores >> >> > * zookeeper + solr (heap 4gb) - RAM 8gb, 2 cpu cores >> >> > * zookeeper >> >> > >> >> > Solr data: >> >> > * 21 collections >> >> > * Many fields, small docs, docs count per collection from 1k to 500k >> >> > >> >> > About a week ago solr started crashing. It crashes every day, 3-4 >> times >> > a >> >> > day. Usually at nigh. I can't tell anything what could it be related >> to >> >> > because at that time we haven't done any configuration changes. Load >> >> > haven't changed too. >> >> > >> >> > >> >> > Everything starts with Stopping recovery for .. warnings (every >> > warnings is >> >> > repeated several times): >> >> > >> >> > WARN org.apache.solr.cloud.RecoveryStrategy; Stopping recovery for >> >> > zkNodeName=core_node1core=****************** >> >> > >> >> > WARN org.apache.solr.cloud.ElectionContext; cancelElection did not >> find >> >> > election node to remove >> >> > >> >> > WARN org.apache.solr.update.PeerSync; no frame of reference to tell >> if >> >> > we've missed updates >> >> > >> >> > WARN - 2014-03-23 04:00:26.286; org.apache.solr.update.PeerSync; no >> > frame >> >> > of reference to tell if we've missed updates >> >> > >> >> > WARN - 2014-03-23 04:00:30.728; org.apache.solr.handler.SnapPuller; >> > File >> >> > _f9m_Lucene41_0.doc expected to be 6218278 while it is 7759879 >> >> > >> >> > WARN - 2014-03-23 04:00:54.126; >> >> > org.apache.solr.update.UpdateLog$LogReplayer; Starting log replay >> >> > >> > >> tlog{file=/path/solr/collection1_shard1_replica2/data/tlog/tlog.0000000000000003272 >> >> > refcount=2} active=true starting pos=356216606 >> >> > >> >> > Then again Stopping recovery for .. warnings: >> >> > >> >> > WARN org.apache.solr.cloud.RecoveryStrategy; Stopping recovery for >> >> > zkNodeName=core_node1core=****************** >> >> > >> >> > ERROR - 2014-03-23 05:19:29.566; >> org.apache.solr.common.SolrException; >> >> > org.apache.solr.common.SolrException: No registered leader was found >> > after >> >> > waiting for 4000ms , collection: collection1 slice: shard1 >> >> > >> >> > ERROR - 2014-03-23 05:20:03.961; >> org.apache.solr.common.SolrException; >> >> > org.apache.solr.common.SolrException: I was asked to wait on state >> down >> > for >> >> > IP:PORT_solr but I still do not see the requested state. I see state: >> >> > active live:false >> >> > >> >> > >> >> > After this serves mostly didn't recover. >> >> >> >> >> >> >> >> -- >> >> Regards, >> >> Shalin Shekhar Mangar. >> >> >> >> >> >> >> >> -- >> Regards, >> Shalin Shekhar Mangar. >> > >