Cascading failures with replicas

Walter Underwood Fri, 17 Mar 2017 17:41:44 -0700

I’m running a 4x4 cluster (4 shards, replication factor 4) on 16 hosts. I shut 
down Solr on one host because it got into some kind of bad, can’t-recover state 
where it was causing timeouts across the whole cluster (bug #1).


I ran a load benchmark near the capacity of the cluster. This had run fine in 
test, this was the prod cluster.

Solr Cloud added a replica to replace the down node. The node with two cores 
got double the traffic and started slowly flapping in and out of service. The 
95th percentile response spiked from 3 seconds to 100 seconds. At some point, 
another replica was made, with two replicas from the same shard on the same 
instance. Naturally, that was overloaded, and I killed the benchmark out of 
charity.

Bug #2 is creating a new replica when one host is down. This should be an 
option and default to “false”, because it causes the cascade.

Bug #3 is sending equal traffic to each core without considering the host. Each 
host should get equal traffic, not each core.

Bug #4 is putting two replicas from the same shard on one instance. That is 
just asking for trouble.

When it works, this cluster is awesome.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

Cascading failures with replicas

Reply via email to