Hi Jeff, The leader election relies on ephemeral nodes in Zookeeper to detect when leader or other nodes have gone down (abruptly). These ephemeral nodes are automatically deleted by ZooKeeper after the ZK session timeout which is by default 30 seconds. So if you kill a node then it can take up to 30 seconds for the cluster to detect it and start a new leader election. This won't be necessary during a graceful shutdown because on shutdown the node will give up leader position so that a new one can be elected. You could tune the zk session timeout to a lower value but then it makes the cluster more sensitive to GC pauses which can also trigger new leader elections.
On Mon, Sep 21, 2015 at 5:55 PM, Jeff Wu <wuhai...@gmail.com> wrote: > Our environment still run with Solr4.7. Recently we noticed in a test. When > we stopped 1 solr server(solr02, which did OS shutdown), all the cores of > solr02 are shown as "down", but remains a few cores still as leaders. After > that, we quickly seeing all other servers are still sending requests to > that down solr server, and therefore we saw a lot of TCP waiting threads in > thread pool of other solr servers since solr02 already down. > > "shard53":{ > "range":"26660000-2998ffff", > "state":"active", > "replicas":{ > "core_node102":{ > "state":"down", > "base_url":"https://solr02.myhost/solr", > "core":"collection2_shard53_replica1", > "node_name":"https://solr02.myhost_solr", > "leader":"true"}, > "core_node104":{ > "state":"active", > "base_url":"https://solr04.myhost/solr", > "core":"collection2_shard53_replica2", > "node_name":"https://solr04.myhost/solr_solr"}}}, > > Is this something known bug in 4.7 and late on fixed? Any reference JIRA we > can study about? If the solr service is stopped gracefully, we can see > leader core election happens and switched to other active core. But if we > just directly shutdown a Solr OS, we can reproduce in our environment that > some "Down" cores remains "leader" at ZK clusterstate.json -- Regards, Shalin Shekhar Mangar.