Hi, I have a cluster with 5 Solr nodes (4.6 release) and 5 ZKs, with around 2000 collections (each with single shard, each shard having 1 or 2 replicas), running on Tomcat. Each Solr node hosts around 1000 physical cores.
When starting any node, I almost always see errors like: 2013-12-19 18:45:42,454 [coreLoadExecutor-4-thread-721] ERROR org.apache.solr.cloud.ZkController- Error getting leader from zk org.apache.solr.common.SolrException: Could not get leader props at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:945) at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:909) at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:873) at org.apache.solr.cloud.ZkController.register(ZkController.java:807) at org.apache.solr.cloud.ZkController.register(ZkController.java:757) at org.apache.solr.core.ZkContainer.registerInZk(ZkContainer.java:272) at org.apache.solr.core.CoreContainer.registerCore(CoreContainer.java:489) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:272) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:263) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /collections/core6_20131120/leaders/shard1 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:264) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:261) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65) It happens just for some cores, usually for about 10-20 of them out of 1000 on one node (each time different cores fail). These 10-20 cores are then marked as "down" and they are never "recovered", while other cores work ok. I did check ZK, there really is no node "/collections/core_20131120/leaders/shard1", but "/collections/core_20131120/leaders" exists, so it looks like "shard1" was removed (maybe during previous shutdown?). Also, when I stop all nodes and clear ZK state, and after that start Solr (rolling starting nodes one by one), all nodes start properly and all cores are properly loaded ("active"). But after that, first restart of any Solr node causes issues on that node. Any ideas about possible cause? And shouldn't Solr maybe try to recover from such situation? Thanks, Bojan