I don't know that I've ever seen anyone test so many cores with SolrCloud. Perhaps there is a timeout that is too low, or ...
Can you file a JIRA issue? I can do some tests. On Fri, Dec 20, 2013 at 11:22 AM, Bojan Šmid <bos...@gmail.com> wrote: > Hi, > > I have a cluster with 5 Solr nodes (4.6 release) and 5 ZKs, with around > 2000 collections (each with single shard, each shard having 1 or 2 > replicas), running on Tomcat. Each Solr node hosts around 1000 physical > cores. > > When starting any node, I almost always see errors like: > > 2013-12-19 18:45:42,454 [coreLoadExecutor-4-thread-721] ERROR > org.apache.solr.cloud.ZkController- Error getting leader from zk > org.apache.solr.common.SolrException: Could not get leader props > at > org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:945) > at > org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:909) > at > org.apache.solr.cloud.ZkController.getLeader(ZkController.java:873) > at > org.apache.solr.cloud.ZkController.register(ZkController.java:807) > at > org.apache.solr.cloud.ZkController.register(ZkController.java:757) > at > org.apache.solr.core.ZkContainer.registerInZk(ZkContainer.java:272) > at > org.apache.solr.core.CoreContainer.registerCore(CoreContainer.java:489) > at > org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:272) > at > org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:263) > at > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > at java.util.concurrent.FutureTask.run(FutureTask.java:166) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > at java.util.concurrent.FutureTask.run(FutureTask.java:166) > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:722) > Caused by: org.apache.zookeeper.KeeperException$NoNodeException: > KeeperErrorCode = NoNode for /collections/core6_20131120/leaders/shard1 > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:111) > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151) > at > org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:264) > at > org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:261) > at > > org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65) > > It happens just for some cores, usually for about 10-20 of them out of > 1000 on one node (each time different cores fail). These 10-20 cores are > then marked as "down" and they are never "recovered", while other cores > work ok. > > I did check ZK, there really is no node > "/collections/core_20131120/leaders/shard1", but > "/collections/core_20131120/leaders" exists, so it looks like "shard1" was > removed (maybe during previous shutdown?). > > Also, when I stop all nodes and clear ZK state, and after that start Solr > (rolling starting nodes one by one), all nodes start properly and all cores > are properly loaded ("active"). But after that, first restart of any Solr > node causes issues on that node. > > Any ideas about possible cause? And shouldn't Solr maybe try to recover > from such situation? > > Thanks, > > Bojan > -- - Mark