Re: Dealing with bad apples in a SolrCloud cluster

Mark Miller Fri, 21 Nov 2014 11:08:32 -0800

bq.  esp. since we've set max threads so high to avoid distributed
dead-lock.


We should fix this for 5.0 - add a second thread pool that is used for
internal requests. We can make it optional if necessary (simpler default
container support), but it's a fairly easy improvement I think.

- Mark

On Fri Nov 21 2014 at 1:56:51 PM Timothy Potter <thelabd...@gmail.com>
wrote:

> Just soliciting some advice from the community ...
>
> Let's say I have a 10-node SolrCloud cluster and have a single collection
> with 2 shards with replication factor 10, so basically each shard has one
> replica on each of my nodes.
>
> Now imagine one of those nodes starts getting into a bad state and starts
> to be slow about serving queries (not bad enough to crash outright though)
> ... I'm sure we could ponder any number of ways a box might slow down
> without crashing.
>
> From my calculations, about 2/10ths of the queries will now be affected
> since
>
> 1/10 queries from client apps will hit the bad apple
>   +
> 1/10 queries from other replicas will hit the bad apple (distrib=false)
>
>
> If QPS is high enough and the bad apple is slow enough, things can start to
> get out of control pretty fast, esp. since we've set max threads so high to
> avoid distributed dead-lock.
>
> What have others done to mitigate this risk? Anything we can do in Solr to
> help deal with this? It seems reasonable that nodes can identify a bad
> apple by keeping track of query times and looking for nodes that are
> significantly outside (>=2 stddev) what the other nodes are doing. Then
> maybe mark the node as being down in ZooKeeper so clients and other nodes
> stop trying to send requests to it; or maybe a simple policy of just don't
> send requests to that node for a few minutes.
>

Re: Dealing with bad apples in a SolrCloud cluster

Reply via email to