Re: Node not recovering, leader elections not occuring

2016-07-19 Thread Jeff Wartes
It sounds like the node-local version of the ZK clusterstate has diverged from the ZK cluster state. You should check the contents of zookeeper and verify the state there looks sane. I’ve had issues (v5.4) on a few occasions where leader election got screwed up to the point where I had to delete

Re: Node not recovering, leader elections not occuring

2016-07-19 Thread Tom Evans
On the nodes that have the replica in a recovering state we now see: 19-07-2016 16:18:28 ERROR RecoveryStrategy:159 - Error while trying to recover. core=lookups_shard1_replica8:org.apache.solr.common.SolrException: No registered leader was found after waiting for 4000ms , collection: lookups slic

Re: Node not recovering, leader elections not occuring

2016-07-19 Thread Tom Evans
There are 11 collections, each only has one shard, and each node has 10 replicas (9 collections are on every node, 2 are just on one node). We're not seeing any OOM errors on restart. I think we're being patient waiting for the leader election to occur. We stopped the troublesome "leader that is n

Re: Node not recovering, leader elections not occuring

2016-07-19 Thread Erick Erickson
How many replicas per Solr JVM? And do you see any OOM errors when you bounce a server? And how patient are you being, because it can take 3 minutes for a leaderless shard to decide it needs to elect a leader. See SOLR-7280 and SOLR-7191 for the case where lots of replicas are in the same JVM, the

Node not recovering, leader elections not occuring

2016-07-19 Thread Tom Evans
Hi all - problem with a SolrCloud 5.5.0, we have a node that has most of the collections on it marked as "Recovering" or "Recovery Failed". It attempts to recover from the leader, but the leader responds with: Error while trying to recover. core=iris_shard1_replica1:java.util.concurrent.ExecutionE