How many replicas per Solr JVM? And do you
see any OOM errors when you bounce a server?
And how patient are you being, because it can
take 3 minutes for a leaderless shard to decide
it needs to elect a leader.

See SOLR-7280 and SOLR-7191 for the case
where lots of replicas are in the same JVM,
the tell-tale symptom is errors in the log as you
bring Solr up saying something like
"OutOfMemory error.... unable to create native thread"

SOLR-7280 has patches for 6x and 7x, with a 5x one
being added momentarily.

Best,
Erick

On Tue, Jul 19, 2016 at 7:41 AM, Tom Evans <tevans...@googlemail.com> wrote:
> Hi all - problem with a SolrCloud 5.5.0, we have a node that has most
> of the collections on it marked as "Recovering" or "Recovery Failed".
> It attempts to recover from the leader, but the leader responds with:
>
> Error while trying to recover.
> core=iris_shard1_replica1:java.util.concurrent.ExecutionException:
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
> Error from server at http://172.31.1.171:30000/solr: We are not the
> leader
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> at 
> org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:596)
> at 
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:353)
> at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:224)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: 
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
> Error from server at http://172.31.1.171:30000/solr: We are not the
> leader
> at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:576)
> at 
> org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:284)
> at 
> org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:280)
> ... 5 more
>
> and recovery never occurs.
>
> Each collection in this state has plenty (10+) of active replicas, but
> stopping the server that is marked as the leader doesn't trigger a
> leader election amongst these replicas.
>
> REBALANCELEADERS did nothing.
> FORCELEADER complains that there is already a leader.
> FORCELEADER with the purported leader stopped took 45 seconds,
> reported status of "0" (and no other message) and kept the down node
> as the leader (!)
> Deleting the failed collection from the failed node and re-adding it
> has the same "Leader said I'm not the leader" error message.
>
> Any other ideas?
>
> Cheers
>
> Tom

Reply via email to