I am guessing that it is all related to memory issues. I guess that as the used heap increases, full GC cycles increase causing ZK timeouts which in turn cause more recoveries to be initiated. In the end, everything blows up with the out of memory errors. Do you log GC activity on your servers?
I suggest that you rollback to 4.6.1 for now and upgrade to 4.7.1 when it releases next week. On Mon, Mar 24, 2014 at 7:51 PM, Lukas Mikuckis <lukasmikuc...@gmail.com> wrote: > Yes, we upgraded solr from 4.6.1 to 4.7 3 weeks ago (2 weeks before solr > started crashing). > When we were upgrading, we just upgraded solr and changed versions in > collections configs. > > When solr crashes we get OOM but only 2h after first Stopping recovery > warnings. > > Maybe you have any ideas when Stopping recovery warnings are thrown? > Because now we have no idea what could cause this issue. > > Mon, 24 Mar 2014 04:03:17 GMT Shalin Shekhar Mangar <shalinman...@gmail.com >>: >> >> Did you upgrade recently to Solr 4.7? 4.7 has a bad bug which can >> cause out of memory issues. Can you check your logs for out of memory >> errors? >> >> On Sun, Mar 23, 2014 at 9:07 PM, Lukas Mikuckis <lukasmikuc...@gmail.com> > wrote: >> > Solr version: 4.7 >> > >> > Architecture: >> > 2 solrs (1 shard, leader + replica) >> > 3 zookeepers >> > >> > Servers: >> > * zookeeper + solr (heap 4gb) - RAM 8gb, 2 cpu cores >> > * zookeeper + solr (heap 4gb) - RAM 8gb, 2 cpu cores >> > * zookeeper >> > >> > Solr data: >> > * 21 collections >> > * Many fields, small docs, docs count per collection from 1k to 500k >> > >> > About a week ago solr started crashing. It crashes every day, 3-4 times > a >> > day. Usually at nigh. I can't tell anything what could it be related to >> > because at that time we haven't done any configuration changes. Load >> > haven't changed too. >> > >> > >> > Everything starts with Stopping recovery for .. warnings (every > warnings is >> > repeated several times): >> > >> > WARN org.apache.solr.cloud.RecoveryStrategy; Stopping recovery for >> > zkNodeName=core_node1core=****************** >> > >> > WARN org.apache.solr.cloud.ElectionContext; cancelElection did not find >> > election node to remove >> > >> > WARN org.apache.solr.update.PeerSync; no frame of reference to tell if >> > we've missed updates >> > >> > WARN - 2014-03-23 04:00:26.286; org.apache.solr.update.PeerSync; no > frame >> > of reference to tell if we've missed updates >> > >> > WARN - 2014-03-23 04:00:30.728; org.apache.solr.handler.SnapPuller; > File >> > _f9m_Lucene41_0.doc expected to be 6218278 while it is 7759879 >> > >> > WARN - 2014-03-23 04:00:54.126; >> > org.apache.solr.update.UpdateLog$LogReplayer; Starting log replay >> > > tlog{file=/path/solr/collection1_shard1_replica2/data/tlog/tlog.0000000000000003272 >> > refcount=2} active=true starting pos=356216606 >> > >> > Then again Stopping recovery for .. warnings: >> > >> > WARN org.apache.solr.cloud.RecoveryStrategy; Stopping recovery for >> > zkNodeName=core_node1core=****************** >> > >> > ERROR - 2014-03-23 05:19:29.566; org.apache.solr.common.SolrException; >> > org.apache.solr.common.SolrException: No registered leader was found > after >> > waiting for 4000ms , collection: collection1 slice: shard1 >> > >> > ERROR - 2014-03-23 05:20:03.961; org.apache.solr.common.SolrException; >> > org.apache.solr.common.SolrException: I was asked to wait on state down > for >> > IP:PORT_solr but I still do not see the requested state. I see state: >> > active live:false >> > >> > >> > After this serves mostly didn't recover. >> >> >> >> -- >> Regards, >> Shalin Shekhar Mangar. >> >> -- Regards, Shalin Shekhar Mangar.