Yes, it's the intended behavior. The whole point of the onlyIfDown flag was as a safety valve for those who wanted to be cautious and guard against typos and the like.
If you specify onlyIfDown=false and the node still isn't removed from ZK, it's not right. Best, Erick On Tue, Jul 19, 2016 at 10:41 PM, Jerome Yang <jey...@pivotal.io> wrote: > What I'm doing is to simulate host crashed situation. > > Consider this, a host is not connected to the cluster. > > So, if a host crashed, I can not delete the down replicas by using > onlyIfDown='true'. > But in solr admin ui, it shows down for these replicas. > And whiteout "onlyIfDown", it still show a failure: > Delete replica failed: Attempted to remove replica : > demo.public.tbl/shard0/core_node4 with onlyIfDown='true', but state is > 'active'. > > Is this the right behavior? If a hosts gone, I can not delete replicas in > this host? > > Regards, > Jerome > > On Wed, Jul 20, 2016 at 1:58 AM, Justin Lee <lee.justi...@gmail.com> wrote: > >> Thanks for taking the time for the detailed response. I completely get what >> you are saying. Makes sense. >> On Tue, Jul 19, 2016 at 10:56 AM Erick Erickson <erickerick...@gmail.com> >> wrote: >> >> > Justin: >> > >> > Well, "kill -9" just makes it harder. The original question >> > was whether a replica being "active" was a bug, and it's >> > not when you kill -9; the Solr node has no chance to >> > tell Zookeeper it's going away. ZK does modify >> > the live_nodes by itself, thus there are checks as >> > necessary when a replica's state is referenced >> > whether the node is also in live_nodes. And an >> > overwhelming amount of the time this is OK, Solr >> > recovers just fine. >> > >> > As far as the write locks are concerned, those are >> > a Lucene level issue so if you kill Solr at just the >> > wrong time it's possible that that'll be left over. The >> > write locks are held for as short a period as possible >> > by Lucene, but occasionally they can linger if you kill >> > -9. >> > >> > When a replica comes up, if there is a write lock already, it >> > doesn't just take over; it fails to load instead. >> > >> > A kill -9 won't bring the cluster down by itself except >> > if there are several coincidences. Just don't make >> > it a habit. For instance, consider if you kill -9 on >> > two Solrs that happen to contain all of the replicas >> > for a shard1 for collection1. And you _happen_ to >> > kill them both at just the wrong time and they both >> > leave Lucene write locks for those replicas. Now >> > no replica will come up for shard1 and the collection >> > is unusable. >> > >> > So the shorter form is that using "kill -9" is a poor practice >> > that exposes you to some risk. The hard-core Solr >> > guys work extremely had to compensate for this kind >> > of thing, but kill -9 is a harsh, last-resort option and >> > shouldn't be part of your regular process. And you should >> > expect some "interesting" states when you do. And >> > you should use the bin/solr script to stop Solr >> > gracefully. >> > >> > Best, >> > Erick >> > >> > >> > On Tue, Jul 19, 2016 at 9:29 AM, Justin Lee <lee.justi...@gmail.com> >> > wrote: >> > > Pardon me for hijacking the thread, but I'm curious about something you >> > > said, Erick. I always thought that the point (in part) of going >> through >> > > the pain of using zookeeper and creating replicas was so that the >> system >> > > could seamlessly recover from catastrophic failures. Wouldn't an OOM >> > > condition have a similar effect (or maybe java is better at cleanup on >> > that >> > > kind of error)? The reason I ask is that I'm trying to set up a solr >> > > system that is highly available and I'm a little bit surprised that a >> > kill >> > > -9 on one process on one machine could put the entire system in a bad >> > > state. Is it common to have to address problems like this with manual >> > > intervention in production systems? Ideally, I'd hope to be able to >> set >> > up >> > > a system where a single node dying a horrible death would never require >> > > intervention. >> > > >> > > On Tue, Jul 19, 2016 at 8:54 AM Erick Erickson < >> erickerick...@gmail.com> >> > > wrote: >> > > >> > >> First of all, killing with -9 is A Very Bad Idea. You can >> > >> leave write lock files laying around. You can leave >> > >> the state in an "interesting" place. You haven't given >> > >> Solr a chance to tell Zookeeper that it's going away. >> > >> (which would set the state to "down"). In short >> > >> when you do this you have to deal with the consequences >> > >> yourself, one of which is this mismatch between >> > >> cluster state and live_nodes. >> > >> >> > >> Now, that rant done the bin/solr script tries to stop Solr >> > >> gracefully but issues a kill if solr doesn't stop nicely. Personally >> > >> I think that timeout should be longer, but that's another story. >> > >> >> > >> The onlyIfDown='true' option is there specifically as a >> > >> safety valve. It was provided for those who want to guard against >> > >> typos and the like, so just don't specify it and you should be fine. >> > >> >> > >> Best, >> > >> Erick >> > >> >> > >> On Mon, Jul 18, 2016 at 11:51 PM, Jerome Yang <jey...@pivotal.io> >> > wrote: >> > >> > Hi all, >> > >> > >> > >> > Here's the situation. >> > >> > I'm using solr5.3 in cloud mode. >> > >> > >> > >> > I have 4 nodes. >> > >> > >> > >> > After use "kill -9 pid-solr-node" to kill 2 nodes. >> > >> > These replicas in the two nodes still are "ACTIVE" in zookeeper's >> > >> > state.json. >> > >> > >> > >> > The problem is, when I try to delete these down replicas with >> > >> > parameter onlyIfDown='true'. >> > >> > It says, >> > >> > "Delete replica failed: Attempted to remove replica : >> > >> > demo.public.tbl/shard0/core_node4 with onlyIfDown='true', but state >> is >> > >> > 'active'." >> > >> > >> > >> > From this link: >> > >> > < >> > >> >> > >> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE >> > >> > >> > >> > < >> > >> >> > >> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE >> > >> > >> > >> > < >> > >> >> > >> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE >> > >> > >> > >> > < >> > >> >> > >> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE >> > >> > >> > >> > >> > >> >> > >> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE >> > >> > >> > >> > It says: >> > >> > *NOTE*: when the node the replica is hosted on crashes, the >> replica's >> > >> state >> > >> > may remain ACTIVE in ZK. To determine if the replica is truly >> active, >> > you >> > >> > must also verify that its node >> > >> > < >> > >> >> > >> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.html#getNodeName-- >> > >> > >> > >> > is >> > >> > under /live_nodes in ZK (or use >> ClusterState.liveNodesContain(String) >> > >> > < >> > >> >> > >> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/ClusterState.html#liveNodesContain-java.lang.String- >> > >> > >> > >> > ). >> > >> > >> > >> > So, is this a bug? >> > >> > >> > >> > Regards, >> > >> > Jerome >> > >> >> > >>