We have a SolrCloud setup (within a Cloudera deployment) using Solr
4.10.3.  The Solr cluster consists of 2 nodes.  Both their backing store is
on HDFS (not on the local file system).  Once every 1-2 weeks, the system
goes into recovery without apparent reason.  Digging through the logs it
looks something like this:

>From time to time, when the load is high, the leader/replica election goes
wrong. We see that

May 31, 10:31:46.524 AM ERROR org.apache.solr.core.SolrCore
org.apache.solr.common.SolrException: ClusterState says we are the leader (
http://grbbd1nodp06.core.local:8983/solr/lily_entity_CUSTOMER_shard1_replica2),
but locally we don't think so. Request came from null

When that happens, the cluster starts recovering:

May 31, 10:33:04.621 AM INFO org.apache.solr.cloud.RecoveryStrategy
Publishing state of core lily_entity_CUSTOMER_shard2_replica2 as
recovering, leader is
http://grbbd1nodp05.core.local:8983/solr/lily_entity_CUSTOMER_shard2_replica1/
and I am
http://grbbd1nodp06.core.local:8983/solr/lily_entity_CUSTOMER_shard2_replica2/

Apparently, that doesn't go too smoothly either. First it tries something
called "PeerSync" that fails and then Solr goes to "replication". I suspect
that "PeerSync" is a recovery strategy where the last N updates are
transferred from one node to other in order for that other node to catch
up. And "replication" probably is copying the entire contents of a node to
another.

PeerSync Recovery was not successful - trying replication.
core=lily_models_shard2_replica1
grbbd1nodp06.core.local INFO May 31, 2016 10:32 AM RecoveryStrategy
Starting Replication Recovery. core=lily_models_shard2_replica1

Apparently the recovery of a Solr node happens piecemeal. For some pieces,
the replication or PeerSync seems to fail, for others it does not fail.

Wait 2.0 seconds before trying to recover again (1)
grbbd1nodp06.core.local ERROR May 31, 2016 10:32 AM RecoveryStrategy
Error while trying to recover:org.apache.solr.common.SolrException:
Replication for recovery failed.
at
org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:168)
at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:448)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:237)
grbbd1nodp06.core.local ERROR May 31, 2016 10:32 AM RecoveryStrategy
Recovery failed - trying again... (0)
core=lily_entity_CUSTOMER_shard2_replica2
grbbd1nodp06.core.local ERROR May 31, 2016 10:32 AM ReplicationHandler
SnapPull failed :org.apache.solr.common.SolrException: Index fetch failed :
at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:573)
at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:310)
at
org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:349)
at
org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:165)
at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:448)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:237)
Caused by: java.io.FileNotFoundException: File does not exist:
hdfs://grbbd1clup01-ns/solr/lily_entity_CUSTOMER/core_node2/data/index.20160429093321915/segments_8hp9
at
org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:1218)
at
org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:1210)
at
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1210)
at
org.apache.solr.store.hdfs.HdfsDirectory$HdfsIndexInput.<init>(HdfsDirectory.java:205)
at
org.apache.solr.store.hdfs.HdfsDirectory.openInput(HdfsDirectory.java:136)
at
org.apache.solr.store.blockcache.BlockDirectory.openInput(BlockDirectory.java:124)
at
org.apache.solr.store.blockcache.BlockDirectory.openInput(BlockDirectory.java:144)
at
org.apache.lucene.store.NRTCachingDirectory.openInput(NRTCachingDirectory.java:198)
at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:113)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:341)
at org.apache.solr.handler.SnapPuller.hasUnusedFiles(SnapPuller.java:623)
at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:456)
... 5 more

After 10 minutes or so, the system does seem to recover.

I dug through the Solr bugs but could not find this exact issue.  I'm
particularly intrigued by the reason why the leader/replica election seems
to go wrong.  Any suggestions?

Reply via email to