"Last Gasp" is the last message that Sun Storage controllers would send to each other when things whet sideways... For what it's worth.
> Date: Fri, 21 Nov 2014 14:07:12 -0500 > From: michael.della.bi...@appinions.com > To: solr-user@lucene.apache.org > Subject: Re: Dealing with bad apples in a SolrCloud cluster > > Good discussion topic. > > I'm wondering if Solr doesn't need some sort of "shoot the other node in > the head" functionality. > > We ran into one of failure modes that only AWS can dream up recently, > where for an extended amount of time, two nodes in the same placement > group couldn't talk to one another, but they could both see Zookeeper, > so nothing was marked as down. > > I've written a basic monitoring script that periodically tries to access > every node in the cluster from every other node, but I haven't gotten to > the point that I've automated anything based on that. It does trigger > now and again for brief moments of time. > > It'd be nice if there was some way the cluster could achieve some > consensus that a particular node is a bad apple, and evict it from > collections that have other active replicas. Not sure what the logic > would be that would allow it to rejoin those collections after the > situation passed, however. > > Michael > > On 11/21/14 13:54, Timothy Potter wrote: > > Just soliciting some advice from the community ... > > > > Let's say I have a 10-node SolrCloud cluster and have a single collection > > with 2 shards with replication factor 10, so basically each shard has one > > replica on each of my nodes. > > > > Now imagine one of those nodes starts getting into a bad state and starts > > to be slow about serving queries (not bad enough to crash outright though) > > ... I'm sure we could ponder any number of ways a box might slow down > > without crashing. > > > > From my calculations, about 2/10ths of the queries will now be affected > > since > > > > 1/10 queries from client apps will hit the bad apple > > + > > 1/10 queries from other replicas will hit the bad apple (distrib=false) > > > > > > If QPS is high enough and the bad apple is slow enough, things can start to > > get out of control pretty fast, esp. since we've set max threads so high to > > avoid distributed dead-lock. > > > > What have others done to mitigate this risk? Anything we can do in Solr to > > help deal with this? It seems reasonable that nodes can identify a bad > > apple by keeping track of query times and looking for nodes that are > > significantly outside (>=2 stddev) what the other nodes are doing. Then > > maybe mark the node as being down in ZooKeeper so clients and other nodes > > stop trying to send requests to it; or maybe a simple policy of just don't > > send requests to that node for a few minutes. > > >