Justin:

Well, "kill -9" just makes it harder. The original question
was whether a replica being "active" was a bug, and it's
not when you kill -9; the Solr node has no chance to
tell Zookeeper it's going away. ZK does modify
the live_nodes by itself, thus there are checks as
necessary when a replica's state is referenced
whether the node is also in live_nodes. And an
overwhelming amount of the time this is OK, Solr
recovers just fine.

As far as the write locks are concerned, those are
a Lucene level issue so if you kill Solr at just the
wrong time it's possible that that'll be left over. The
write locks are held for as short a period as possible
by Lucene, but occasionally they can linger if you kill
-9.

When a replica comes up, if there is a write lock already, it
doesn't just take over; it fails to load instead.

A kill -9 won't bring the cluster down by itself except
if there are several coincidences. Just don't make
it a habit. For instance, consider if you kill -9 on
two Solrs that happen to contain all of the replicas
for a shard1 for collection1. And you _happen_ to
kill them both at just the wrong time and they both
leave Lucene write locks for those replicas. Now
no replica will come up for shard1 and the collection
is unusable.

So the shorter form is that using "kill -9" is a poor practice
that exposes you to some risk. The hard-core Solr
guys work extremely had to compensate for this kind
of thing, but kill -9 is a harsh, last-resort option and
shouldn't be part of your regular process. And you should
expect some "interesting" states when you do. And
you should use the bin/solr script to stop Solr
gracefully.

Best,
Erick


On Tue, Jul 19, 2016 at 9:29 AM, Justin Lee <lee.justi...@gmail.com> wrote:
> Pardon me for hijacking the thread, but I'm curious about something you
> said, Erick.  I always thought that the point (in part) of going through
> the pain of using zookeeper and creating replicas was so that the system
> could seamlessly recover from catastrophic failures.  Wouldn't an OOM
> condition have a similar effect (or maybe java is better at cleanup on that
> kind of error)?  The reason I ask is that I'm trying to set up a solr
> system that is highly available and I'm a little bit surprised that a kill
> -9 on one process on one machine could put the entire system in a bad
> state.  Is it common to have to address problems like this with manual
> intervention in production systems?  Ideally, I'd hope to be able to set up
> a system where a single node dying a horrible death would never require
> intervention.
>
> On Tue, Jul 19, 2016 at 8:54 AM Erick Erickson <erickerick...@gmail.com>
> wrote:
>
>> First of all, killing with -9 is A Very Bad Idea. You can
>> leave write lock files laying around. You can leave
>> the state in an "interesting" place. You haven't given
>> Solr a chance to tell Zookeeper that it's going away.
>> (which would set the state to "down"). In short
>> when you do this you have to deal with the consequences
>> yourself, one of which is this mismatch between
>> cluster state and live_nodes.
>>
>> Now, that rant done the bin/solr script tries to stop Solr
>> gracefully but issues a kill if solr doesn't stop nicely. Personally
>> I think that timeout should be longer, but that's another story.
>>
>> The onlyIfDown='true' option is there specifically as a
>> safety valve. It was provided for those who want to guard against
>> typos and the like, so just don't specify it and you should be fine.
>>
>> Best,
>> Erick
>>
>> On Mon, Jul 18, 2016 at 11:51 PM, Jerome Yang <jey...@pivotal.io> wrote:
>> > Hi all,
>> >
>> > Here's the situation.
>> > I'm using solr5.3 in cloud mode.
>> >
>> > I have 4 nodes.
>> >
>> > After use "kill -9 pid-solr-node" to kill 2 nodes.
>> > These replicas in the two nodes still are "ACTIVE" in zookeeper's
>> > state.json.
>> >
>> > The problem is, when I try to delete these down replicas with
>> > parameter onlyIfDown='true'.
>> > It says,
>> > "Delete replica failed: Attempted to remove replica :
>> > demo.public.tbl/shard0/core_node4 with onlyIfDown='true', but state is
>> > 'active'."
>> >
>> > From this link:
>> > <
>> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE
>> >
>> > <
>> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE
>> >
>> > <
>> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE
>> >
>> > <
>> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE
>> >
>> >
>> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE
>> >
>> > It says:
>> > *NOTE*: when the node the replica is hosted on crashes, the replica's
>> state
>> > may remain ACTIVE in ZK. To determine if the replica is truly active, you
>> > must also verify that its node
>> > <
>> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.html#getNodeName--
>> >
>> > is
>> > under /live_nodes in ZK (or use ClusterState.liveNodesContain(String)
>> > <
>> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/ClusterState.html#liveNodesContain-java.lang.String-
>> >
>> > ).
>> >
>> > So, is this a bug?
>> >
>> > Regards,
>> > Jerome
>>

Reply via email to