Solr version: 4.7

Architecture:
2 solrs (1 shard, leader + replica)
3 zookeepers

Servers:
* zookeeper + solr (heap 4gb) - RAM 8gb, 2 cpu cores
* zookeeper + solr  (heap 4gb) - RAM 8gb, 2 cpu cores
* zookeeper

Solr data:
* 21 collections
* Many fields, small docs, docs count per collection from 1k to 500k

About a week ago solr started crashing. It crashes every day, 3-4 times a
day. Usually at nigh. I can't tell anything what could it be related to
because at that time we haven't done any configuration changes. Load
haven't changed too.


Everything starts with Stopping recovery for .. warnings (every warnings is
repeated several times):

WARN  org.apache.solr.cloud.RecoveryStrategy; Stopping recovery for
zkNodeName=core_node1core=******************

WARN  org.apache.solr.cloud.ElectionContext; cancelElection did not find
election node to remove

WARN  org.apache.solr.update.PeerSync; no frame of reference to tell if
we've missed updates

WARN  - 2014-03-23 04:00:26.286; org.apache.solr.update.PeerSync; no frame
of reference to tell if we've missed updates

WARN  - 2014-03-23 04:00:30.728; org.apache.solr.handler.SnapPuller; File
_f9m_Lucene41_0.doc expected to be 6218278 while it is 7759879

WARN  - 2014-03-23 04:00:54.126;
org.apache.solr.update.UpdateLog$LogReplayer; Starting log replay
tlog{file=/path/solr/collection1_shard1_replica2/data/tlog/tlog.0000000000000003272
refcount=2} active=true starting pos=356216606

Then again Stopping recovery for .. warnings:

WARN  org.apache.solr.cloud.RecoveryStrategy; Stopping recovery for
zkNodeName=core_node1core=******************

ERROR - 2014-03-23 05:19:29.566; org.apache.solr.common.SolrException;
org.apache.solr.common.SolrException: No registered leader was found after
waiting for 4000ms , collection: collection1 slice: shard1

ERROR - 2014-03-23 05:20:03.961; org.apache.solr.common.SolrException;
org.apache.solr.common.SolrException: I was asked to wait on state down for
IP:PORT_solr but I still do not see the requested state. I see state:
active live:false


After this serves mostly didn't recover.

Reply via email to