As Eric mentions, his change to have a state where indexing happens but querying doesn't surely helps in this case.
But these are still boolean decisions of send vs don't send. In general, it would be nice to abstract the routing policy so that it is pluggable. You could then do stuff like have a "least pending" policy for choosing replicas -- instead of choosing a replica at random, you maintain a pending response count, and you always send to the one with least pending (or randomly amongst a set of replicas if there is a tie). Also the chances your distrib=false case will be hit is actually 1/5 (or something like that, I have forgotten my probability theory). Because you have two shards and you get two chances at hitting the bad apple. This was one of the reasons we got in SOLR-6730 to use replica and host affinity. Under good enough load, the load distribution will more or less be the same with this change, but chances of hitting bad apples will be lesser.. On 21 Nov 2014 18:56, "Timothy Potter" <thelabd...@gmail.com> wrote: Just soliciting some advice from the community ... Let's say I have a 10-node SolrCloud cluster and have a single collection with 2 shards with replication factor 10, so basically each shard has one replica on each of my nodes. Now imagine one of those nodes starts getting into a bad state and starts to be slow about serving queries (not bad enough to crash outright though) ... I'm sure we could ponder any number of ways a box might slow down without crashing. >From my calculations, about 2/10ths of the queries will now be affected since 1/10 queries from client apps will hit the bad apple + 1/10 queries from other replicas will hit the bad apple (distrib=false) If QPS is high enough and the bad apple is slow enough, things can start to get out of control pretty fast, esp. since we've set max threads so high to avoid distributed dead-lock. What have others done to mitigate this risk? Anything we can do in Solr to help deal with this? It seems reasonable that nodes can identify a bad apple by keeping track of query times and looking for nodes that are significantly outside (>=2 stddev) what the other nodes are doing. Then maybe mark the node as being down in ZooKeeper so clients and other nodes stop trying to send requests to it; or maybe a simple policy of just don't send requests to that node for a few minutes.