Hi,

  I have a cluster with 5 Solr nodes (4.6 release) and 5 ZKs, with around
2000 collections (each with single shard, each shard having 1 or 2
replicas), running on Tomcat. Each Solr node hosts around 1000 physical
cores.

  When starting any node, I almost always see errors like:

2013-12-19 18:45:42,454 [coreLoadExecutor-4-thread-721] ERROR
org.apache.solr.cloud.ZkController- Error getting leader from zk
org.apache.solr.common.SolrException: Could not get leader props
        at
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:945)
        at
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:909)
        at
org.apache.solr.cloud.ZkController.getLeader(ZkController.java:873)
        at
org.apache.solr.cloud.ZkController.register(ZkController.java:807)
        at
org.apache.solr.cloud.ZkController.register(ZkController.java:757)
        at
org.apache.solr.core.ZkContainer.registerInZk(ZkContainer.java:272)
        at
org.apache.solr.core.CoreContainer.registerCore(CoreContainer.java:489)
        at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:272)
        at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:263)
        at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
KeeperErrorCode = NoNode for /collections/core6_20131120/leaders/shard1
        at
org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
        at
org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
        at
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:264)
        at
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:261)
        at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)

  It happens just for some cores, usually for about 10-20 of them out of
1000 on one node (each time different cores fail). These 10-20 cores are
then marked as "down" and they are never "recovered", while other cores
work ok.

  I did check ZK, there really is no node
"/collections/core_20131120/leaders/shard1", but
"/collections/core_20131120/leaders" exists, so it looks like "shard1" was
removed (maybe during previous shutdown?).

  Also, when I stop all nodes and clear ZK state, and after that start Solr
(rolling starting nodes one by one), all nodes start properly and all cores
are properly loaded ("active"). But after that, first restart of any Solr
node causes issues on that node.

  Any ideas about possible cause? And shouldn't Solr maybe try to recover
from such situation?

  Thanks,

  Bojan

Reply via email to