Unexplained leader initiated recovery after updates

Lindsay Martin Fri, 09 Jan 2015 15:55:10 -0800

Hi all,

I am experiencing a problem where Solr nodes go into recovery following an 
update cycle.


Examination of the logs indicates that the recovery is initiated by the shard 
master while processing regular update events, because the replica is 
unreachable.

For example, the following is recorded in the leader’s log file:

...
2014-12-15 05:14:03.285 [qtp2092193830-400307] INFO  
org.apache.solr.cloud.ZkController  Put replica core=listings 
coreNodeName=solr12:8983_solr_listings on solr12:8983_solr into 
leader-initiated recovery.
2014-12-15 05:14:03.285 [qtp2092193830-400307] WARN  
org.apache.solr.cloud.ZkController  Leader is publishing core=listings 
coreNodeName =solr12:8983_solr_listings state=down on behalf of un-reachable 
replica http://solr12:8983/solr/listings/; forcePublishState? false
2014-12-15 05:14:03.287 [zkCallback-2-thread-20] INFO  
org.apache.solr.cloud.DistributedQueue LatchChildWatcher fired on path: 
/overseer/queue state: SyncConnected type NodeChildrenChanged
...

However when I check, I cannot detect any connectivity problems between the 
leader and the replica. About 40% of the time, the nodes recover without any 
intervention in 4 or 5 minutes. The remaining 60% of the time however, the 
recovering node reports a java.lang.OutOfMemoryError and Solr needs to be 
restarted.

For background, here are some details about our configuration:
* Solr 4.10.2 (problem also observed with Solr 4.6.1)
* 12 shards with 2 nodes per shard
* a single updater running in a separate subnet is posting updates using the 
SolrJ CloudSolrServer client. Updates are triggered hourly.
* system is under continuous query load
* autoCommit is set to 821 seconds
* autoSoftCommit is set to 303 seconds

I cannot correlate these recovery events to an increase in update or query 
load. The query traffic is does not appear to be affected by any transient 
connectivity issues.

The only clear pattern is that these recovery events happen after an updater 
run and the cluster is busy processing the updates.

Can suggest where to look to figure out why these recovery events are occurring?

Thanks,

Lindsay Martin

Unexplained leader initiated recovery after updates

Reply via email to