Happens to us too. Solr 4.7.2 On Sep 21, 2015 20:42, "Jeff Wu" <wuhai...@gmail.com> wrote:
> Hi Shai, still the same question: other peer cores which they are active > did not claim to be leader after a long time. However, some of the peer > cores claimed to be leaders at earlier time when server stopping. That's > inconsistent results > > 2015-09-21 10:52 GMT-04:00 Shai Erera <ser...@gmail.com>: > > > I don't think the process Shalin describes applies to clusterstate.json. > > That JSON object reflects the status Solr "knows" about, or "last known > > status". When Solr is properly shutdown, I believe those attributes are > > cleared from clusterstate.json, as well the leaders give up their lease. > > > > However, when Solr is killed, it takes ZK the 30 seconds or so timeout to > > kill the ephemeral node and release the leader lease. ZK is unaware of > > Solr's clusterstate.json and cannot update the 'leader' property to > false. > > It simply releases the lease, so that other cores may claim it. > > > > Perhaps that explains the confusion? > > > > Shai > > > > On Mon, Sep 21, 2015 at 4:36 PM, Jeff Wu <wuhai...@gmail.com> wrote: > > > > > Hi Shalin, thank you for the response. > > > > > > We waited longer enough than the ZK session timeout time, and it still > > did > > > not kick off any leader election for these "remained down-leader" > cores. > > > That's the question I'm actually asking. > > > > > > Our test scenario: > > > > > > Each solr server has 64 cores, and they are all active, and all leader > > > cores. > > > Shutdown the linux OS. > > > Monitor clusterstate.json over ZK, after enough ZK session timeout > value. > > > We noticed some cores has leader election happened. But still saw some > > down > > > cores remains leader. > > > > > > 2015-09-21 9:15 GMT-04:00 Shalin Shekhar Mangar < > shalinman...@gmail.com > > >: > > > > > > > Hi Jeff, > > > > > > > > The leader election relies on ephemeral nodes in Zookeeper to detect > > > > when leader or other nodes have gone down (abruptly). These ephemeral > > > > nodes are automatically deleted by ZooKeeper after the ZK session > > > > timeout which is by default 30 seconds. So if you kill a node then it > > > > can take up to 30 seconds for the cluster to detect it and start a > new > > > > leader election. This won't be necessary during a graceful shutdown > > > > because on shutdown the node will give up leader position so that a > > > > new one can be elected. You could tune the zk session timeout to a > > > > lower value but then it makes the cluster more sensitive to GC pauses > > > > which can also trigger new leader elections. > > > > > > > > On Mon, Sep 21, 2015 at 5:55 PM, Jeff Wu <wuhai...@gmail.com> wrote: > > > > > Our environment still run with Solr4.7. Recently we noticed in a > > test. > > > > When > > > > > we stopped 1 solr server(solr02, which did OS shutdown), all the > > cores > > > of > > > > > solr02 are shown as "down", but remains a few cores still as > leaders. > > > > After > > > > > that, we quickly seeing all other servers are still sending > requests > > to > > > > > that down solr server, and therefore we saw a lot of TCP waiting > > > threads > > > > in > > > > > thread pool of other solr servers since solr02 already down. > > > > > > > > > > "shard53":{ > > > > > "range":"26660000-2998ffff", > > > > > "state":"active", > > > > > "replicas":{ > > > > > "core_node102":{ > > > > > "state":"down", > > > > > "base_url":"https://solr02.myhost/solr", > > > > > "core":"collection2_shard53_replica1", > > > > > "node_name":"https://solr02.myhost_solr", > > > > > "leader":"true"}, > > > > > "core_node104":{ > > > > > "state":"active", > > > > > "base_url":"https://solr04.myhost/solr", > > > > > "core":"collection2_shard53_replica2", > > > > > "node_name":"https://solr04.myhost/solr_solr"}}}, > > > > > > > > > > Is this something known bug in 4.7 and late on fixed? Any reference > > > JIRA > > > > we > > > > > can study about? If the solr service is stopped gracefully, we can > > see > > > > > leader core election happens and switched to other active core. But > > if > > > we > > > > > just directly shutdown a Solr OS, we can reproduce in our > environment > > > > that > > > > > some "Down" cores remains "leader" at ZK clusterstate.json > > > > > > > > > > > > > > > > -- > > > > Regards, > > > > Shalin Shekhar Mangar. > > > > > > > > > > > > > -- > Jeff Wu > --------------------------- > CSDL Beijing, China >