Re: solr4.7: leader core does not elected to other active core after sorl OS shutdown, known issue?

Gili Nachum Mon, 21 Sep 2015 11:37:52 -0700

Happens to us too. Solr 4.7.2
On Sep 21, 2015 20:42, "Jeff Wu" <wuhai...@gmail.com> wrote:


> Hi Shai, still the same question: other peer cores which they are active
> did not claim to be leader after a long time.  However, some of the peer
> cores claimed to be leaders at earlier time when server stopping. That's
> inconsistent results
>
> 2015-09-21 10:52 GMT-04:00 Shai Erera <ser...@gmail.com>:
>
> > I don't think the process Shalin describes applies to clusterstate.json.
> > That JSON object reflects the status Solr "knows" about, or "last known
> > status". When Solr is properly shutdown, I believe those attributes are
> > cleared from clusterstate.json, as well the leaders give up their lease.
> >
> > However, when Solr is killed, it takes ZK the 30 seconds or so timeout to
> > kill the ephemeral node and release the leader lease. ZK is unaware of
> > Solr's clusterstate.json and cannot update the 'leader' property to
> false.
> > It simply releases the lease, so that other cores may claim it.
> >
> > Perhaps that explains the confusion?
> >
> > Shai
> >
> > On Mon, Sep 21, 2015 at 4:36 PM, Jeff Wu <wuhai...@gmail.com> wrote:
> >
> > > Hi Shalin,  thank you for the response.
> > >
> > > We waited longer enough than the ZK session timeout time, and it still
> > did
> > > not kick off any leader election for these "remained down-leader"
> cores.
> > > That's the question I'm actually asking.
> > >
> > > Our test scenario:
> > >
> > > Each solr server has 64 cores, and they are all active, and all leader
> > > cores.
> > > Shutdown the linux OS.
> > > Monitor clusterstate.json over ZK, after enough ZK session timeout
> value.
> > > We noticed some cores has leader election happened. But still saw some
> > down
> > > cores remains leader.
> > >
> > > 2015-09-21 9:15 GMT-04:00 Shalin Shekhar Mangar <
> shalinman...@gmail.com
> > >:
> > >
> > > > Hi Jeff,
> > > >
> > > > The leader election relies on ephemeral nodes in Zookeeper to detect
> > > > when leader or other nodes have gone down (abruptly). These ephemeral
> > > > nodes are automatically deleted by ZooKeeper after the ZK session
> > > > timeout which is by default 30 seconds. So if you kill a node then it
> > > > can take up to 30 seconds for the cluster to detect it and start a
> new
> > > > leader election. This won't be necessary during a graceful shutdown
> > > > because on shutdown the node will give up leader position so that a
> > > > new one can be elected. You could tune the zk session timeout to a
> > > > lower value but then it makes the cluster more sensitive to GC pauses
> > > > which can also trigger new leader elections.
> > > >
> > > > On Mon, Sep 21, 2015 at 5:55 PM, Jeff Wu <wuhai...@gmail.com> wrote:
> > > > > Our environment still run with Solr4.7. Recently we noticed in a
> > test.
> > > > When
> > > > > we stopped 1 solr server(solr02, which did OS shutdown), all the
> > cores
> > > of
> > > > > solr02 are shown as "down", but remains a few cores still as
> leaders.
> > > > After
> > > > > that, we quickly seeing all other servers are still sending
> requests
> > to
> > > > > that down solr server, and therefore we saw a lot of TCP waiting
> > > threads
> > > > in
> > > > > thread pool of other solr servers since solr02 already down.
> > > > >
> > > > > "shard53":{
> > > > >         "range":"26660000-2998ffff",
> > > > >         "state":"active",
> > > > >         "replicas":{
> > > > >           "core_node102":{
> > > > >             "state":"down",
> > > > >             "base_url":"https://solr02.myhost/solr";,
> > > > >             "core":"collection2_shard53_replica1",
> > > > >             "node_name":"https://solr02.myhost_solr";,
> > > > >             "leader":"true"},
> > > > >           "core_node104":{
> > > > >             "state":"active",
> > > > >             "base_url":"https://solr04.myhost/solr";,
> > > > >             "core":"collection2_shard53_replica2",
> > > > >             "node_name":"https://solr04.myhost/solr_solr"}}},
> > > > >
> > > > > Is this something known bug in 4.7 and late on fixed? Any reference
> > > JIRA
> > > > we
> > > > > can study about?  If the solr service is stopped gracefully, we can
> > see
> > > > > leader core election happens and switched to other active core. But
> > if
> > > we
> > > > > just directly shutdown a Solr OS, we can reproduce in our
> environment
> > > > that
> > > > > some "Down" cores remains "leader" at ZK clusterstate.json
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > Shalin Shekhar Mangar.
> > > >
> > >
> >
>
>
>
> --
> Jeff Wu
> ---------------------------
> CSDL Beijing, China
>

Re: solr4.7: leader core does not elected to other active core after sorl OS shutdown, known issue?

Reply via email to