Re: SolrCloud - "KeeperErrorCode = NoNode" - after restart

Mark Miller Sun, 22 Dec 2013 22:23:56 -0800

I don't know that I've ever seen anyone test so many cores with SolrCloud.
Perhaps there is a timeout that is too low, or ...


Can you file a JIRA issue? I can do some tests.


On Fri, Dec 20, 2013 at 11:22 AM, Bojan Šmid <bos...@gmail.com> wrote:

> Hi,
>
>   I have a cluster with 5 Solr nodes (4.6 release) and 5 ZKs, with around
> 2000 collections (each with single shard, each shard having 1 or 2
> replicas), running on Tomcat. Each Solr node hosts around 1000 physical
> cores.
>
>   When starting any node, I almost always see errors like:
>
> 2013-12-19 18:45:42,454 [coreLoadExecutor-4-thread-721] ERROR
> org.apache.solr.cloud.ZkController- Error getting leader from zk
> org.apache.solr.common.SolrException: Could not get leader props
>         at
> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:945)
>         at
> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:909)
>         at
> org.apache.solr.cloud.ZkController.getLeader(ZkController.java:873)
>         at
> org.apache.solr.cloud.ZkController.register(ZkController.java:807)
>         at
> org.apache.solr.cloud.ZkController.register(ZkController.java:757)
>         at
> org.apache.solr.core.ZkContainer.registerInZk(ZkContainer.java:272)
>         at
> org.apache.solr.core.CoreContainer.registerCore(CoreContainer.java:489)
>         at
> org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:272)
>         at
> org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:263)
>         at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>         at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:722)
> Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
> KeeperErrorCode = NoNode for /collections/core6_20131120/leaders/shard1
>         at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>         at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
>         at
> org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:264)
>         at
> org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:261)
>         at
>
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)
>
>   It happens just for some cores, usually for about 10-20 of them out of
> 1000 on one node (each time different cores fail). These 10-20 cores are
> then marked as "down" and they are never "recovered", while other cores
> work ok.
>
>   I did check ZK, there really is no node
> "/collections/core_20131120/leaders/shard1", but
> "/collections/core_20131120/leaders" exists, so it looks like "shard1" was
> removed (maybe during previous shutdown?).
>
>   Also, when I stop all nodes and clear ZK state, and after that start Solr
> (rolling starting nodes one by one), all nodes start properly and all cores
> are properly loaded ("active"). But after that, first restart of any Solr
> node causes issues on that node.
>
>   Any ideas about possible cause? And shouldn't Solr maybe try to recover
> from such situation?
>
>   Thanks,
>
>   Bojan
>



-- 
- Mark

Re: SolrCloud - "KeeperErrorCode = NoNode" - after restart

Reply via email to