Justin: Well, "kill -9" just makes it harder. The original question was whether a replica being "active" was a bug, and it's not when you kill -9; the Solr node has no chance to tell Zookeeper it's going away. ZK does modify the live_nodes by itself, thus there are checks as necessary when a replica's state is referenced whether the node is also in live_nodes. And an overwhelming amount of the time this is OK, Solr recovers just fine.
As far as the write locks are concerned, those are a Lucene level issue so if you kill Solr at just the wrong time it's possible that that'll be left over. The write locks are held for as short a period as possible by Lucene, but occasionally they can linger if you kill -9. When a replica comes up, if there is a write lock already, it doesn't just take over; it fails to load instead. A kill -9 won't bring the cluster down by itself except if there are several coincidences. Just don't make it a habit. For instance, consider if you kill -9 on two Solrs that happen to contain all of the replicas for a shard1 for collection1. And you _happen_ to kill them both at just the wrong time and they both leave Lucene write locks for those replicas. Now no replica will come up for shard1 and the collection is unusable. So the shorter form is that using "kill -9" is a poor practice that exposes you to some risk. The hard-core Solr guys work extremely had to compensate for this kind of thing, but kill -9 is a harsh, last-resort option and shouldn't be part of your regular process. And you should expect some "interesting" states when you do. And you should use the bin/solr script to stop Solr gracefully. Best, Erick On Tue, Jul 19, 2016 at 9:29 AM, Justin Lee <lee.justi...@gmail.com> wrote: > Pardon me for hijacking the thread, but I'm curious about something you > said, Erick. I always thought that the point (in part) of going through > the pain of using zookeeper and creating replicas was so that the system > could seamlessly recover from catastrophic failures. Wouldn't an OOM > condition have a similar effect (or maybe java is better at cleanup on that > kind of error)? The reason I ask is that I'm trying to set up a solr > system that is highly available and I'm a little bit surprised that a kill > -9 on one process on one machine could put the entire system in a bad > state. Is it common to have to address problems like this with manual > intervention in production systems? Ideally, I'd hope to be able to set up > a system where a single node dying a horrible death would never require > intervention. > > On Tue, Jul 19, 2016 at 8:54 AM Erick Erickson <erickerick...@gmail.com> > wrote: > >> First of all, killing with -9 is A Very Bad Idea. You can >> leave write lock files laying around. You can leave >> the state in an "interesting" place. You haven't given >> Solr a chance to tell Zookeeper that it's going away. >> (which would set the state to "down"). In short >> when you do this you have to deal with the consequences >> yourself, one of which is this mismatch between >> cluster state and live_nodes. >> >> Now, that rant done the bin/solr script tries to stop Solr >> gracefully but issues a kill if solr doesn't stop nicely. Personally >> I think that timeout should be longer, but that's another story. >> >> The onlyIfDown='true' option is there specifically as a >> safety valve. It was provided for those who want to guard against >> typos and the like, so just don't specify it and you should be fine. >> >> Best, >> Erick >> >> On Mon, Jul 18, 2016 at 11:51 PM, Jerome Yang <jey...@pivotal.io> wrote: >> > Hi all, >> > >> > Here's the situation. >> > I'm using solr5.3 in cloud mode. >> > >> > I have 4 nodes. >> > >> > After use "kill -9 pid-solr-node" to kill 2 nodes. >> > These replicas in the two nodes still are "ACTIVE" in zookeeper's >> > state.json. >> > >> > The problem is, when I try to delete these down replicas with >> > parameter onlyIfDown='true'. >> > It says, >> > "Delete replica failed: Attempted to remove replica : >> > demo.public.tbl/shard0/core_node4 with onlyIfDown='true', but state is >> > 'active'." >> > >> > From this link: >> > < >> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE >> > >> > < >> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE >> > >> > < >> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE >> > >> > < >> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE >> > >> > >> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE >> > >> > It says: >> > *NOTE*: when the node the replica is hosted on crashes, the replica's >> state >> > may remain ACTIVE in ZK. To determine if the replica is truly active, you >> > must also verify that its node >> > < >> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.html#getNodeName-- >> > >> > is >> > under /live_nodes in ZK (or use ClusterState.liveNodesContain(String) >> > < >> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/ClusterState.html#liveNodesContain-java.lang.String- >> > >> > ). >> > >> > So, is this a bug? >> > >> > Regards, >> > Jerome >>