How many replicas per Solr JVM? And do you see any OOM errors when you bounce a server? And how patient are you being, because it can take 3 minutes for a leaderless shard to decide it needs to elect a leader.
See SOLR-7280 and SOLR-7191 for the case where lots of replicas are in the same JVM, the tell-tale symptom is errors in the log as you bring Solr up saying something like "OutOfMemory error.... unable to create native thread" SOLR-7280 has patches for 6x and 7x, with a 5x one being added momentarily. Best, Erick On Tue, Jul 19, 2016 at 7:41 AM, Tom Evans <tevans...@googlemail.com> wrote: > Hi all - problem with a SolrCloud 5.5.0, we have a node that has most > of the collections on it marked as "Recovering" or "Recovery Failed". > It attempts to recover from the leader, but the leader responds with: > > Error while trying to recover. > core=iris_shard1_replica1:java.util.concurrent.ExecutionException: > org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: > Error from server at http://172.31.1.171:30000/solr: We are not the > leader > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.util.concurrent.FutureTask.get(FutureTask.java:192) > at > org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:596) > at > org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:353) > at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:224) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: > org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: > Error from server at http://172.31.1.171:30000/solr: We are not the > leader > at > org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:576) > at > org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:284) > at > org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:280) > ... 5 more > > and recovery never occurs. > > Each collection in this state has plenty (10+) of active replicas, but > stopping the server that is marked as the leader doesn't trigger a > leader election amongst these replicas. > > REBALANCELEADERS did nothing. > FORCELEADER complains that there is already a leader. > FORCELEADER with the purported leader stopped took 45 seconds, > reported status of "0" (and no other message) and kept the down node > as the leader (!) > Deleting the failed collection from the failed node and re-adding it > has the same "Leader said I'm not the leader" error message. > > Any other ideas? > > Cheers > > Tom