Moshin:

As the author of the transient cores stuff I can authoritatively state
that it wasn't designed with SolrCloud in mind, so I'd be a little
careful about extending that functionality, even by analogy ;).
Not to say that it's totally incompatible, but....

That said, I may be working on some code soon that allows a node
to be put into a new state where the only thing it is allowed to handle
is update requests, all other requests are sent to other nodes.

Consider the situation where, say, the RAID for a machine needs to
have a disk replaced, but you _don't_ want to re-synch it later, something
like a high query rate combined with low update rates. It'd would be nice
to have a way to tell SolrCloud "don't send anything but update requests
to me". It seems that such a mechanism could be extended to handle
the bad apple case. The node wouldn't be an Overseer, shard leader
or handle queries.

So there would be a "goodness factor".
1> everything's fine, the current "active" state.
2> Do the minimal work to stay in sync, some new state.
3> I'm a bad apple, really. I suspect this is the current "down" state.

So far in my thinking, this a manual process, probably a new
Collections API call. Or maybe there's a new set of "Node API" calls,
I'm not quite sure yet, that's TBD. But it seems like something we should
throw into the mix for this conversation, and think about adding
some kind of automation to the process.

Erick

On Fri, Nov 21, 2014 at 4:22 PM, Mohsin Beg Beg <mohsin....@oracle.com> wrote:
>
> How about dynamic loading/unloading of some shards (cores) similar to the 
> transient cores feature. Should be ok if the unloaded shard has a replica. If 
> no replica then extending shards.tolerant concept to use some 
> timeout/acceptable-latency value sounds interesting.
>
> -Mohsin
>
> ----- Original Message -----
> From: thelabd...@gmail.com
> To: solr-user@lucene.apache.org
> Sent: Friday, November 21, 2014 10:56:51 AM GMT -08:00 US/Canada Pacific
> Subject: Dealing with bad apples in a SolrCloud cluster
>
> Just soliciting some advice from the community ...
>
> Let's say I have a 10-node SolrCloud cluster and have a single collection
> with 2 shards with replication factor 10, so basically each shard has one
> replica on each of my nodes.
>
> Now imagine one of those nodes starts getting into a bad state and starts
> to be slow about serving queries (not bad enough to crash outright though)
> ... I'm sure we could ponder any number of ways a box might slow down
> without crashing.
>
> From my calculations, about 2/10ths of the queries will now be affected
> since
>
> 1/10 queries from client apps will hit the bad apple
>   +
> 1/10 queries from other replicas will hit the bad apple (distrib=false)
>
>
> If QPS is high enough and the bad apple is slow enough, things can start to
> get out of control pretty fast, esp. since we've set max threads so high to
> avoid distributed dead-lock.
>
> What have others done to mitigate this risk? Anything we can do in Solr to
> help deal with this? It seems reasonable that nodes can identify a bad
> apple by keeping track of query times and looking for nodes that are
> significantly outside (>=2 stddev) what the other nodes are doing. Then
> maybe mark the node as being down in ZooKeeper so clients and other nodes
> stop trying to send requests to it; or maybe a simple policy of just don't
> send requests to that node for a few minutes.

Reply via email to