: By setting onlyIfDown=false, it did remove the replica. But still return a : failure message. : That confuse me.
yeah, there's a lot going on here that should definitley be fixed/improved -- having just tried to walk a mile in your shoes I realized it's even more confusing then i first realized skimming your initial question. I've filed a jira to track how painful this all is for users -- thank you for helping point this out, and sticking with your question to help raise visibility of how nonsensical all this is in context... https://issues.apache.org/jira/browse/SOLR-9361 : On Thu, Jul 21, 2016 at 5:47 AM, Chris Hostetter <hossman_luc...@fucit.org> : wrote: : : > : > Maybe the problem here is some confusion/ambuguity about the meaning of : > "down" ? : > : > TL;DR: think of "onlyIfDown" as "onlyIfShutDownCleanly" : > : > : > IIUC, the purpose of the 'onlyIfDown' is a safety valve so (by default) : > the cluster will prevent you from removing a replica that wasn't shutdown : > *cleanly* and is officially in a "down" state -- as recorded in the : > ClusterState for the collection (either the collections state.json or the : > global clusterstate.json if you have an older solr instance) : > : > when you kill -9 a solr node, the replicas that were hosted on that node : > will typically still be listed in the cluster state as "active" -- but it : > will *not* be in live_nodes, which is how solr knows that replica can't : > currently be used (and leader recovery happens as needed, etc...). : > : > If, however, you shut the node down cleanly (or if -- for whatever reason : > -- the node is up, but the replica's SolrCore is not active) then the : > cluster state will record that replica as "down" : > : > Where things unfortunately get confusing, is that the CLUSTERSTATUS api : > call -- aparently in an attempt to try and implify things -- changes the : > recorded status of any replica to "down" if that replica is hosted on a : > node which is not in live_nodes. : > : > I suspect that since hte UI uses the CLUSTERSTATUS api to get it's state : > information, it doesn't display much diff between a replica shut down : > cleanly and a replica that is hosted on a node which died abruptly. : > : > I suspect that's where your confusion is coming from? : > : > : > Ultimately, what onlyIfDown is trying to do is help ensure that you don't : > accidently delete a replica that you didn't mean to. the opertaing : > assumption is that the only replicas you will (typically) delete are : > replicas that you shut down cleanly ... if a replica is down because of a : > hard crash, then that is an exceptional situation and presumibly you will : > either: a) try to bring the replica back up; b) delete the replica using : > onlyIfDown=false to indicate that you know the replica you are deleting : > isn't 'down' intentionally, but you want do delete it anyway. : > : > : > : > : > : > On Wed, 20 Jul 2016, Erick Erickson wrote: : > : > : Date: Wed, 20 Jul 2016 08:26:32 -0700 : > : From: Erick Erickson <erickerick...@gmail.com> : > : Reply-To: solr-user@lucene.apache.org : > : To: solr-user <solr-user@lucene.apache.org> : > : Subject: Re: Send kill -9 to a node and can not delete down replicas with : > : onlyIfDown. : > : : > : Yes, it's the intended behavior. The whole point of the : > : onlyIfDown flag was as a safety valve for those : > : who wanted to be cautious and guard against typos : > : and the like. : > : : > : If you specify onlyIfDown=false and the node still : > : isn't removed from ZK, it's not right. : > : : > : Best, : > : Erick : > : : > : On Tue, Jul 19, 2016 at 10:41 PM, Jerome Yang <jey...@pivotal.io> wrote: : > : > What I'm doing is to simulate host crashed situation. : > : > : > : > Consider this, a host is not connected to the cluster. : > : > : > : > So, if a host crashed, I can not delete the down replicas by using : > : > onlyIfDown='true'. : > : > But in solr admin ui, it shows down for these replicas. : > : > And whiteout "onlyIfDown", it still show a failure: : > : > Delete replica failed: Attempted to remove replica : : > : > demo.public.tbl/shard0/core_node4 with onlyIfDown='true', but state is : > : > 'active'. : > : > : > : > Is this the right behavior? If a hosts gone, I can not delete replicas : > in : > : > this host? : > : > : > : > Regards, : > : > Jerome : > : > : > : > On Wed, Jul 20, 2016 at 1:58 AM, Justin Lee <lee.justi...@gmail.com> : > wrote: : > : > : > : >> Thanks for taking the time for the detailed response. I completely : > get what : > : >> you are saying. Makes sense. : > : >> On Tue, Jul 19, 2016 at 10:56 AM Erick Erickson < : > erickerick...@gmail.com> : > : >> wrote: : > : >> : > : >> > Justin: : > : >> > : > : >> > Well, "kill -9" just makes it harder. The original question : > : >> > was whether a replica being "active" was a bug, and it's : > : >> > not when you kill -9; the Solr node has no chance to : > : >> > tell Zookeeper it's going away. ZK does modify : > : >> > the live_nodes by itself, thus there are checks as : > : >> > necessary when a replica's state is referenced : > : >> > whether the node is also in live_nodes. And an : > : >> > overwhelming amount of the time this is OK, Solr : > : >> > recovers just fine. : > : >> > : > : >> > As far as the write locks are concerned, those are : > : >> > a Lucene level issue so if you kill Solr at just the : > : >> > wrong time it's possible that that'll be left over. The : > : >> > write locks are held for as short a period as possible : > : >> > by Lucene, but occasionally they can linger if you kill : > : >> > -9. : > : >> > : > : >> > When a replica comes up, if there is a write lock already, it : > : >> > doesn't just take over; it fails to load instead. : > : >> > : > : >> > A kill -9 won't bring the cluster down by itself except : > : >> > if there are several coincidences. Just don't make : > : >> > it a habit. For instance, consider if you kill -9 on : > : >> > two Solrs that happen to contain all of the replicas : > : >> > for a shard1 for collection1. And you _happen_ to : > : >> > kill them both at just the wrong time and they both : > : >> > leave Lucene write locks for those replicas. Now : > : >> > no replica will come up for shard1 and the collection : > : >> > is unusable. : > : >> > : > : >> > So the shorter form is that using "kill -9" is a poor practice : > : >> > that exposes you to some risk. The hard-core Solr : > : >> > guys work extremely had to compensate for this kind : > : >> > of thing, but kill -9 is a harsh, last-resort option and : > : >> > shouldn't be part of your regular process. And you should : > : >> > expect some "interesting" states when you do. And : > : >> > you should use the bin/solr script to stop Solr : > : >> > gracefully. : > : >> > : > : >> > Best, : > : >> > Erick : > : >> > : > : >> > : > : >> > On Tue, Jul 19, 2016 at 9:29 AM, Justin Lee <lee.justi...@gmail.com : > > : > : >> > wrote: : > : >> > > Pardon me for hijacking the thread, but I'm curious about : > something you : > : >> > > said, Erick. I always thought that the point (in part) of going : > : >> through : > : >> > > the pain of using zookeeper and creating replicas was so that the : > : >> system : > : >> > > could seamlessly recover from catastrophic failures. Wouldn't an : > OOM : > : >> > > condition have a similar effect (or maybe java is better at : > cleanup on : > : >> > that : > : >> > > kind of error)? The reason I ask is that I'm trying to set up a : > solr : > : >> > > system that is highly available and I'm a little bit surprised : > that a : > : >> > kill : > : >> > > -9 on one process on one machine could put the entire system in a : > bad : > : >> > > state. Is it common to have to address problems like this with : > manual : > : >> > > intervention in production systems? Ideally, I'd hope to be able : > to : > : >> set : > : >> > up : > : >> > > a system where a single node dying a horrible death would never : > require : > : >> > > intervention. : > : >> > > : > : >> > > On Tue, Jul 19, 2016 at 8:54 AM Erick Erickson < : > : >> erickerick...@gmail.com> : > : >> > > wrote: : > : >> > > : > : >> > >> First of all, killing with -9 is A Very Bad Idea. You can : > : >> > >> leave write lock files laying around. You can leave : > : >> > >> the state in an "interesting" place. You haven't given : > : >> > >> Solr a chance to tell Zookeeper that it's going away. : > : >> > >> (which would set the state to "down"). In short : > : >> > >> when you do this you have to deal with the consequences : > : >> > >> yourself, one of which is this mismatch between : > : >> > >> cluster state and live_nodes. : > : >> > >> : > : >> > >> Now, that rant done the bin/solr script tries to stop Solr : > : >> > >> gracefully but issues a kill if solr doesn't stop nicely. : > Personally : > : >> > >> I think that timeout should be longer, but that's another story. : > : >> > >> : > : >> > >> The onlyIfDown='true' option is there specifically as a : > : >> > >> safety valve. It was provided for those who want to guard against : > : >> > >> typos and the like, so just don't specify it and you should be : > fine. : > : >> > >> : > : >> > >> Best, : > : >> > >> Erick : > : >> > >> : > : >> > >> On Mon, Jul 18, 2016 at 11:51 PM, Jerome Yang <jey...@pivotal.io : > > : > : >> > wrote: : > : >> > >> > Hi all, : > : >> > >> > : > : >> > >> > Here's the situation. : > : >> > >> > I'm using solr5.3 in cloud mode. : > : >> > >> > : > : >> > >> > I have 4 nodes. : > : >> > >> > : > : >> > >> > After use "kill -9 pid-solr-node" to kill 2 nodes. : > : >> > >> > These replicas in the two nodes still are "ACTIVE" in : > zookeeper's : > : >> > >> > state.json. : > : >> > >> > : > : >> > >> > The problem is, when I try to delete these down replicas with : > : >> > >> > parameter onlyIfDown='true'. : > : >> > >> > It says, : > : >> > >> > "Delete replica failed: Attempted to remove replica : : > : >> > >> > demo.public.tbl/shard0/core_node4 with onlyIfDown='true', but : > state : > : >> is : > : >> > >> > 'active'." : > : >> > >> > : > : >> > >> > From this link: : > : >> > >> > < : > : >> > >> : > : >> > : > : >> : > http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE : > : >> > >> > : > : >> > >> > < : > : >> > >> : > : >> > : > : >> : > http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE : > : >> > >> > : > : >> > >> > < : > : >> > >> : > : >> > : > : >> : > http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE : > : >> > >> > : > : >> > >> > < : > : >> > >> : > : >> > : > : >> : > http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE : > : >> > >> > : > : >> > >> > : > : >> > >> : > : >> > : > : >> : > http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE : > : >> > >> > : > : >> > >> > It says: : > : >> > >> > *NOTE*: when the node the replica is hosted on crashes, the : > : >> replica's : > : >> > >> state : > : >> > >> > may remain ACTIVE in ZK. To determine if the replica is truly : > : >> active, : > : >> > you : > : >> > >> > must also verify that its node : > : >> > >> > < : > : >> > >> : > : >> > : > : >> : > http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.html#getNodeName-- : > : >> > >> > : > : >> > >> > is : > : >> > >> > under /live_nodes in ZK (or use : > : >> ClusterState.liveNodesContain(String) : > : >> > >> > < : > : >> > >> : > : >> > : > : >> : > http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/ClusterState.html#liveNodesContain-java.lang.String- : > : >> > >> > : > : >> > >> > ). : > : >> > >> > : > : >> > >> > So, is this a bug? : > : >> > >> > : > : >> > >> > Regards, : > : >> > >> > Jerome : > : >> > >> : > : >> > : > : >> : > : : > : > -Hoss : > http://www.lucidworks.com/ : > : -Hoss http://www.lucidworks.com/