Hi All - I had one node in a 45 shard cluster (9 physical machines) run
out of memory. I stopped all the nodes in the cluster and removed any
lingering write.lock files from the OOM in HDFS. All the nodes
recovered except one replica of one shard that happens to be on the node
that ran out of memory. The error is:
Error while trying to recover:org.apache.solr.common.SolrException:
Replication for recovery failed.
at
org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:159)
at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:408)
at
org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:221)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Anything I can check? The index is stored in HDFS. It seems to keep
looping retrying over and over.
Thank you!
-Joe