Hi all, I am experiencing a problem where Solr nodes go into recovery following an update cycle.
Examination of the logs indicates that the recovery is initiated by the shard master while processing regular update events, because the replica is unreachable. For example, the following is recorded in the leader’s log file: ... 2014-12-15 05:14:03.285 [qtp2092193830-400307] INFO org.apache.solr.cloud.ZkController Put replica core=listings coreNodeName=solr12:8983_solr_listings on solr12:8983_solr into leader-initiated recovery. 2014-12-15 05:14:03.285 [qtp2092193830-400307] WARN org.apache.solr.cloud.ZkController Leader is publishing core=listings coreNodeName =solr12:8983_solr_listings state=down on behalf of un-reachable replica http://solr12:8983/solr/listings/; forcePublishState? false 2014-12-15 05:14:03.287 [zkCallback-2-thread-20] INFO org.apache.solr.cloud.DistributedQueue LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type NodeChildrenChanged ... However when I check, I cannot detect any connectivity problems between the leader and the replica. About 40% of the time, the nodes recover without any intervention in 4 or 5 minutes. The remaining 60% of the time however, the recovering node reports a java.lang.OutOfMemoryError and Solr needs to be restarted. For background, here are some details about our configuration: * Solr 4.10.2 (problem also observed with Solr 4.6.1) * 12 shards with 2 nodes per shard * a single updater running in a separate subnet is posting updates using the SolrJ CloudSolrServer client. Updates are triggered hourly. * system is under continuous query load * autoCommit is set to 821 seconds * autoSoftCommit is set to 303 seconds I cannot correlate these recovery events to an increase in update or query load. The query traffic is does not appear to be affected by any transient connectivity issues. The only clear pattern is that these recovery events happen after an updater run and the cluster is busy processing the updates. Can suggest where to look to figure out why these recovery events are occurring? Thanks, Lindsay Martin