Found this error which likely explains my issue with new replicas not coming up, not sure next step. Almost looks like Zookeeper's record of a Shard's leader is not being updated?
4/8/2015, 4:56:03 PM ERROR ShardLeaderElectionContext There was a problem trying to register as the leader:org.apache.solr.common.SolrException: Could not register as the leader because creating the ephemeral registration node in ZooKeeper failed There was a problem trying to register as the leader:org.apache.solr.common.SolrException: Could not register as the leader because creating the ephemeral registration node in ZooKeeper failed at org.apache.solr.cloud.ShardLeaderElectionContextBase.runLeaderProcess(ElectionContext.java:150) at org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:306) at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:163) at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:125) at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55) at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:358) at org.apache.solr.common.cloud.SolrZkClient$3$1.run(SolrZkClient.java:209) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.solr.common.SolrException: org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /collections/kla_collection/leaders/shard4 at org.apache.solr.common.util.RetryUtil.retryOnThrowable(RetryUtil.java:40) at org.apache.solr.cloud.ShardLeaderElectionContextBase.runLeaderProcess(ElectionContext.java:137) ... 11 more Caused by: org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /collections/kla_collection/leaders/shard4 at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.solr.common.cloud.SolrZkClient$11.execute(SolrZkClient.java:462) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:74) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:459) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:416) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:403) at org.apache.solr.cloud.ShardLeaderElectionContextBase$1.execute(ElectionContext.java:142) at org.apache.solr.common.util.RetryUtil.retryOnThrowable(RetryUtil.java:34) Matt -----Original Message----- From: Matt Kuiper [mailto:matt.kui...@issinc.com] Sent: Wednesday, April 08, 2015 4:36 PM To: solr-user@lucene.apache.org Subject: RE: Clusterstate - state active Erick, Anshum, Thanks for your replies! Yes, it is replica state that I am looking at, and this the answer I was hoping for. I am working on a solution that involves moving some replicas to new Solr nodes as they are made available. Before deleting the original replicas backing the shard, I check the replica state to make sure is active for the new replicas. Initially it was working pretty well, but with more recent testing I regularly see the shard go down. The two new replicas go into failed recovery state after the original replicas are deleted, the logs report that a registered leader was not found for the shard. Initially I was concerned that maybe the new shards were not fully synced with the leader, even though I checked for active state. Now I am wondering if the new shards are somehow competing (or somehow reluctant ) to become leader, and thus neither become leader. I plan to test just creating one new replica on a new solr node, checking for state is active, then deleting original replicas, and then creating second new replica. Any thoughts? Matt -----Original Message----- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, April 08, 2015 4:13 PM To: solr-user@lucene.apache.org Subject: Re: Clusterstate - state active Matt: In a word, "yes". Depending on the size of the index for that shard, the transition from Down->Recovering->Active may be too fast to catch. If replicating the index takes a while, though, you should at least see the "Recovering" state, during which time there won't be any searches forwarded to that node. Best, Erick On Wed, Apr 8, 2015 at 2:58 PM, Matt Kuiper <matt.kui...@issinc.com> wrote: > Hello, > > When creating a new replica, and the state is recorded as active with in ZK > clusterstate, does that mean that new replica has synched with the leader > replica for the particular shard? > > Thanks, > Matt >