Yes, it's the intended behavior. The whole point of the
onlyIfDown flag was as a safety valve for those
who wanted to be cautious and guard against typos
and the like.

If you specify onlyIfDown=false and the node still
isn't removed from ZK, it's not right.

Best,
Erick

On Tue, Jul 19, 2016 at 10:41 PM, Jerome Yang <jey...@pivotal.io> wrote:
> What I'm doing is to simulate host crashed situation.
>
> Consider this, a host is not connected to the cluster.
>
> So, if a host crashed, I can not delete the down replicas by using
> onlyIfDown='true'.
> But in solr admin ui, it shows down for these replicas.
> And whiteout "onlyIfDown", it still show a failure:
> Delete replica failed: Attempted to remove replica :
> demo.public.tbl/shard0/core_node4 with onlyIfDown='true', but state is
> 'active'.
>
> Is this the right behavior? If a hosts gone, I can not delete replicas in
> this host?
>
> Regards,
> Jerome
>
> On Wed, Jul 20, 2016 at 1:58 AM, Justin Lee <lee.justi...@gmail.com> wrote:
>
>> Thanks for taking the time for the detailed response. I completely get what
>> you are saying. Makes sense.
>> On Tue, Jul 19, 2016 at 10:56 AM Erick Erickson <erickerick...@gmail.com>
>> wrote:
>>
>> > Justin:
>> >
>> > Well, "kill -9" just makes it harder. The original question
>> > was whether a replica being "active" was a bug, and it's
>> > not when you kill -9; the Solr node has no chance to
>> > tell Zookeeper it's going away. ZK does modify
>> > the live_nodes by itself, thus there are checks as
>> > necessary when a replica's state is referenced
>> > whether the node is also in live_nodes. And an
>> > overwhelming amount of the time this is OK, Solr
>> > recovers just fine.
>> >
>> > As far as the write locks are concerned, those are
>> > a Lucene level issue so if you kill Solr at just the
>> > wrong time it's possible that that'll be left over. The
>> > write locks are held for as short a period as possible
>> > by Lucene, but occasionally they can linger if you kill
>> > -9.
>> >
>> > When a replica comes up, if there is a write lock already, it
>> > doesn't just take over; it fails to load instead.
>> >
>> > A kill -9 won't bring the cluster down by itself except
>> > if there are several coincidences. Just don't make
>> > it a habit. For instance, consider if you kill -9 on
>> > two Solrs that happen to contain all of the replicas
>> > for a shard1 for collection1. And you _happen_ to
>> > kill them both at just the wrong time and they both
>> > leave Lucene write locks for those replicas. Now
>> > no replica will come up for shard1 and the collection
>> > is unusable.
>> >
>> > So the shorter form is that using "kill -9" is a poor practice
>> > that exposes you to some risk. The hard-core Solr
>> > guys work extremely had to compensate for this kind
>> > of thing, but kill -9 is a harsh, last-resort option and
>> > shouldn't be part of your regular process. And you should
>> > expect some "interesting" states when you do. And
>> > you should use the bin/solr script to stop Solr
>> > gracefully.
>> >
>> > Best,
>> > Erick
>> >
>> >
>> > On Tue, Jul 19, 2016 at 9:29 AM, Justin Lee <lee.justi...@gmail.com>
>> > wrote:
>> > > Pardon me for hijacking the thread, but I'm curious about something you
>> > > said, Erick.  I always thought that the point (in part) of going
>> through
>> > > the pain of using zookeeper and creating replicas was so that the
>> system
>> > > could seamlessly recover from catastrophic failures.  Wouldn't an OOM
>> > > condition have a similar effect (or maybe java is better at cleanup on
>> > that
>> > > kind of error)?  The reason I ask is that I'm trying to set up a solr
>> > > system that is highly available and I'm a little bit surprised that a
>> > kill
>> > > -9 on one process on one machine could put the entire system in a bad
>> > > state.  Is it common to have to address problems like this with manual
>> > > intervention in production systems?  Ideally, I'd hope to be able to
>> set
>> > up
>> > > a system where a single node dying a horrible death would never require
>> > > intervention.
>> > >
>> > > On Tue, Jul 19, 2016 at 8:54 AM Erick Erickson <
>> erickerick...@gmail.com>
>> > > wrote:
>> > >
>> > >> First of all, killing with -9 is A Very Bad Idea. You can
>> > >> leave write lock files laying around. You can leave
>> > >> the state in an "interesting" place. You haven't given
>> > >> Solr a chance to tell Zookeeper that it's going away.
>> > >> (which would set the state to "down"). In short
>> > >> when you do this you have to deal with the consequences
>> > >> yourself, one of which is this mismatch between
>> > >> cluster state and live_nodes.
>> > >>
>> > >> Now, that rant done the bin/solr script tries to stop Solr
>> > >> gracefully but issues a kill if solr doesn't stop nicely. Personally
>> > >> I think that timeout should be longer, but that's another story.
>> > >>
>> > >> The onlyIfDown='true' option is there specifically as a
>> > >> safety valve. It was provided for those who want to guard against
>> > >> typos and the like, so just don't specify it and you should be fine.
>> > >>
>> > >> Best,
>> > >> Erick
>> > >>
>> > >> On Mon, Jul 18, 2016 at 11:51 PM, Jerome Yang <jey...@pivotal.io>
>> > wrote:
>> > >> > Hi all,
>> > >> >
>> > >> > Here's the situation.
>> > >> > I'm using solr5.3 in cloud mode.
>> > >> >
>> > >> > I have 4 nodes.
>> > >> >
>> > >> > After use "kill -9 pid-solr-node" to kill 2 nodes.
>> > >> > These replicas in the two nodes still are "ACTIVE" in zookeeper's
>> > >> > state.json.
>> > >> >
>> > >> > The problem is, when I try to delete these down replicas with
>> > >> > parameter onlyIfDown='true'.
>> > >> > It says,
>> > >> > "Delete replica failed: Attempted to remove replica :
>> > >> > demo.public.tbl/shard0/core_node4 with onlyIfDown='true', but state
>> is
>> > >> > 'active'."
>> > >> >
>> > >> > From this link:
>> > >> > <
>> > >>
>> >
>> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE
>> > >> >
>> > >> > <
>> > >>
>> >
>> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE
>> > >> >
>> > >> > <
>> > >>
>> >
>> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE
>> > >> >
>> > >> > <
>> > >>
>> >
>> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE
>> > >> >
>> > >> >
>> > >>
>> >
>> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE
>> > >> >
>> > >> > It says:
>> > >> > *NOTE*: when the node the replica is hosted on crashes, the
>> replica's
>> > >> state
>> > >> > may remain ACTIVE in ZK. To determine if the replica is truly
>> active,
>> > you
>> > >> > must also verify that its node
>> > >> > <
>> > >>
>> >
>> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.html#getNodeName--
>> > >> >
>> > >> > is
>> > >> > under /live_nodes in ZK (or use
>> ClusterState.liveNodesContain(String)
>> > >> > <
>> > >>
>> >
>> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/ClusterState.html#liveNodesContain-java.lang.String-
>> > >> >
>> > >> > ).
>> > >> >
>> > >> > So, is this a bug?
>> > >> >
>> > >> > Regards,
>> > >> > Jerome
>> > >>
>> >
>>

Reply via email to