Good discussion topic.
I'm wondering if Solr doesn't need some sort of "shoot the other node in
the head" functionality.
We ran into one of failure modes that only AWS can dream up recently,
where for an extended amount of time, two nodes in the same placement
group couldn't talk to one another, but they could both see Zookeeper,
so nothing was marked as down.
I've written a basic monitoring script that periodically tries to access
every node in the cluster from every other node, but I haven't gotten to
the point that I've automated anything based on that. It does trigger
now and again for brief moments of time.
It'd be nice if there was some way the cluster could achieve some
consensus that a particular node is a bad apple, and evict it from
collections that have other active replicas. Not sure what the logic
would be that would allow it to rejoin those collections after the
situation passed, however.
Michael
On 11/21/14 13:54, Timothy Potter wrote:
Just soliciting some advice from the community ...
Let's say I have a 10-node SolrCloud cluster and have a single collection
with 2 shards with replication factor 10, so basically each shard has one
replica on each of my nodes.
Now imagine one of those nodes starts getting into a bad state and starts
to be slow about serving queries (not bad enough to crash outright though)
... I'm sure we could ponder any number of ways a box might slow down
without crashing.
From my calculations, about 2/10ths of the queries will now be affected
since
1/10 queries from client apps will hit the bad apple
+
1/10 queries from other replicas will hit the bad apple (distrib=false)
If QPS is high enough and the bad apple is slow enough, things can start to
get out of control pretty fast, esp. since we've set max threads so high to
avoid distributed dead-lock.
What have others done to mitigate this risk? Anything we can do in Solr to
help deal with this? It seems reasonable that nodes can identify a bad
apple by keeping track of query times and looking for nodes that are
significantly outside (>=2 stddev) what the other nodes are doing. Then
maybe mark the node as being down in ZooKeeper so clients and other nodes
stop trying to send requests to it; or maybe a simple policy of just don't
send requests to that node for a few minutes.