We have a SolrCloud setup (within a Cloudera deployment) using Solr 4.10.3. The Solr cluster consists of 2 nodes. Both their backing store is on HDFS (not on the local file system). Once every 1-2 weeks, the system goes into recovery without apparent reason. Digging through the logs it looks something like this:
>From time to time, when the load is high, the leader/replica election goes wrong. We see that May 31, 10:31:46.524 AM ERROR org.apache.solr.core.SolrCore org.apache.solr.common.SolrException: ClusterState says we are the leader ( http://grbbd1nodp06.core.local:8983/solr/lily_entity_CUSTOMER_shard1_replica2), but locally we don't think so. Request came from null When that happens, the cluster starts recovering: May 31, 10:33:04.621 AM INFO org.apache.solr.cloud.RecoveryStrategy Publishing state of core lily_entity_CUSTOMER_shard2_replica2 as recovering, leader is http://grbbd1nodp05.core.local:8983/solr/lily_entity_CUSTOMER_shard2_replica1/ and I am http://grbbd1nodp06.core.local:8983/solr/lily_entity_CUSTOMER_shard2_replica2/ Apparently, that doesn't go too smoothly either. First it tries something called "PeerSync" that fails and then Solr goes to "replication". I suspect that "PeerSync" is a recovery strategy where the last N updates are transferred from one node to other in order for that other node to catch up. And "replication" probably is copying the entire contents of a node to another. PeerSync Recovery was not successful - trying replication. core=lily_models_shard2_replica1 grbbd1nodp06.core.local INFO May 31, 2016 10:32 AM RecoveryStrategy Starting Replication Recovery. core=lily_models_shard2_replica1 Apparently the recovery of a Solr node happens piecemeal. For some pieces, the replication or PeerSync seems to fail, for others it does not fail. Wait 2.0 seconds before trying to recover again (1) grbbd1nodp06.core.local ERROR May 31, 2016 10:32 AM RecoveryStrategy Error while trying to recover:org.apache.solr.common.SolrException: Replication for recovery failed. at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:168) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:448) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:237) grbbd1nodp06.core.local ERROR May 31, 2016 10:32 AM RecoveryStrategy Recovery failed - trying again... (0) core=lily_entity_CUSTOMER_shard2_replica2 grbbd1nodp06.core.local ERROR May 31, 2016 10:32 AM ReplicationHandler SnapPull failed :org.apache.solr.common.SolrException: Index fetch failed : at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:573) at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:310) at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:349) at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:165) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:448) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:237) Caused by: java.io.FileNotFoundException: File does not exist: hdfs://grbbd1clup01-ns/solr/lily_entity_CUSTOMER/core_node2/data/index.20160429093321915/segments_8hp9 at org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:1218) at org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:1210) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1210) at org.apache.solr.store.hdfs.HdfsDirectory$HdfsIndexInput.<init>(HdfsDirectory.java:205) at org.apache.solr.store.hdfs.HdfsDirectory.openInput(HdfsDirectory.java:136) at org.apache.solr.store.blockcache.BlockDirectory.openInput(BlockDirectory.java:124) at org.apache.solr.store.blockcache.BlockDirectory.openInput(BlockDirectory.java:144) at org.apache.lucene.store.NRTCachingDirectory.openInput(NRTCachingDirectory.java:198) at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:113) at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:341) at org.apache.solr.handler.SnapPuller.hasUnusedFiles(SnapPuller.java:623) at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:456) ... 5 more After 10 minutes or so, the system does seem to recover. I dug through the Solr bugs but could not find this exact issue. I'm particularly intrigued by the reason why the leader/replica election seems to go wrong. Any suggestions?