RE: Dealing with bad apples in a SolrCloud cluster

steve Fri, 21 Nov 2014 16:10:29 -0800

"Last Gasp" is the last message that Sun Storage controllers would send to each 
other when things whet sideways...
For what it's worth.


> Date: Fri, 21 Nov 2014 14:07:12 -0500
> From: michael.della.bi...@appinions.com
> To: solr-user@lucene.apache.org
> Subject: Re: Dealing with bad apples in a SolrCloud cluster
> 
> Good discussion topic.
> 
> I'm wondering if Solr doesn't need some sort of "shoot the other node in 
> the head" functionality.
> 
> We ran into one of failure modes that only AWS can dream up recently, 
> where for an extended amount of time, two nodes in the same placement 
> group couldn't talk to one another, but they could both see Zookeeper, 
> so nothing was marked as down.
> 
> I've written a basic monitoring script that periodically tries to access 
> every node in the cluster from every other node, but I haven't gotten to 
> the point that I've automated anything based on that. It does trigger 
> now and again for brief moments of time.
> 
> It'd be nice if there was some way the cluster could achieve some 
> consensus that a particular node is a bad apple, and evict it from 
> collections that have other active replicas. Not sure what the logic 
> would be that would allow it to rejoin those collections after the 
> situation passed, however.
> 
> Michael
> 
> On 11/21/14 13:54, Timothy Potter wrote:
> > Just soliciting some advice from the community ...
> >
> > Let's say I have a 10-node SolrCloud cluster and have a single collection
> > with 2 shards with replication factor 10, so basically each shard has one
> > replica on each of my nodes.
> >
> > Now imagine one of those nodes starts getting into a bad state and starts
> > to be slow about serving queries (not bad enough to crash outright though)
> > ... I'm sure we could ponder any number of ways a box might slow down
> > without crashing.
> >
> >  From my calculations, about 2/10ths of the queries will now be affected
> > since
> >
> > 1/10 queries from client apps will hit the bad apple
> >    +
> > 1/10 queries from other replicas will hit the bad apple (distrib=false)
> >
> >
> > If QPS is high enough and the bad apple is slow enough, things can start to
> > get out of control pretty fast, esp. since we've set max threads so high to
> > avoid distributed dead-lock.
> >
> > What have others done to mitigate this risk? Anything we can do in Solr to
> > help deal with this? It seems reasonable that nodes can identify a bad
> > apple by keeping track of query times and looking for nodes that are
> > significantly outside (>=2 stddev) what the other nodes are doing. Then
> > maybe mark the node as being down in ZooKeeper so clients and other nodes
> > stop trying to send requests to it; or maybe a simple policy of just don't
> > send requests to that node for a few minutes.
> >
>

RE: Dealing with bad apples in a SolrCloud cluster

Reply via email to