I’m running a 4x4 cluster (4 shards, replication factor 4) on 16 hosts. I shut down Solr on one host because it got into some kind of bad, can’t-recover state where it was causing timeouts across the whole cluster (bug #1).
I ran a load benchmark near the capacity of the cluster. This had run fine in test, this was the prod cluster. Solr Cloud added a replica to replace the down node. The node with two cores got double the traffic and started slowly flapping in and out of service. The 95th percentile response spiked from 3 seconds to 100 seconds. At some point, another replica was made, with two replicas from the same shard on the same instance. Naturally, that was overloaded, and I killed the benchmark out of charity. Bug #2 is creating a new replica when one host is down. This should be an option and default to “false”, because it causes the cascade. Bug #3 is sending equal traffic to each core without considering the host. Each host should get equal traffic, not each core. Bug #4 is putting two replicas from the same shard on one instance. That is just asking for trouble. When it works, this cluster is awesome. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog)