bq. esp. since we've set max threads so high to avoid distributed dead-lock.
We should fix this for 5.0 - add a second thread pool that is used for internal requests. We can make it optional if necessary (simpler default container support), but it's a fairly easy improvement I think. - Mark On Fri Nov 21 2014 at 1:56:51 PM Timothy Potter <thelabd...@gmail.com> wrote: > Just soliciting some advice from the community ... > > Let's say I have a 10-node SolrCloud cluster and have a single collection > with 2 shards with replication factor 10, so basically each shard has one > replica on each of my nodes. > > Now imagine one of those nodes starts getting into a bad state and starts > to be slow about serving queries (not bad enough to crash outright though) > ... I'm sure we could ponder any number of ways a box might slow down > without crashing. > > From my calculations, about 2/10ths of the queries will now be affected > since > > 1/10 queries from client apps will hit the bad apple > + > 1/10 queries from other replicas will hit the bad apple (distrib=false) > > > If QPS is high enough and the bad apple is slow enough, things can start to > get out of control pretty fast, esp. since we've set max threads so high to > avoid distributed dead-lock. > > What have others done to mitigate this risk? Anything we can do in Solr to > help deal with this? It seems reasonable that nodes can identify a bad > apple by keeping track of query times and looking for nodes that are > significantly outside (>=2 stddev) what the other nodes are doing. Then > maybe mark the node as being down in ZooKeeper so clients and other nodes > stop trying to send requests to it; or maybe a simple policy of just don't > send requests to that node for a few minutes. >