Hi All - I had one node in a 45 shard cluster (9 physical machines) run out of memory. I stopped all the nodes in the cluster and removed any lingering write.lock files from the OOM in HDFS. All the nodes recovered except one replica of one shard that happens to be on the node that ran out of memory. The error is:

Error while trying to recover:org.apache.solr.common.SolrException: Replication for recovery failed. at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:159) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:408) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:221) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Anything I can check? The index is stored in HDFS. It seems to keep looping retrying over and over.

Thank you!

-Joe

Reply via email to