Re: Cascading failures with replicas

Walter Underwood Sat, 18 Mar 2017 21:09:06 -0700

6.3.0. No idea how it is happening, but I got two replicas on the same host 
after one host went down.


wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Mar 18, 2017, at 8:35 PM, Erick Erickson <erickerick...@gmail.com> wrote:
> 
> Hmmm, I'm totally mystified about how Solr is "creating a new replica
> when one host is down". Are you saying this is happening
> automagically? You're right the autoAddReplica bit is HDFS so having
> replicas just show up is completely completely weird. In days past,
> when a replica was discovered on disk when Solr started up, it would
> reconstruct itself _even if the collection had been deleted_. They
> would reappear in clusterstate.json, _not_ the individual state.json
> files under collections in ZK. Is this at all possible?
> 
> What version of Solr are you using anyway? The reconstruction I
> mentioned above is 4x IIRC.
> 
> Best,
> Erick
> 
> On Sat, Mar 18, 2017 at 5:45 PM, Walter Underwood <wun...@wunderwood.org> 
> wrote:
>> Thanks. This is a very CPU-heavy workload, with ngram fields and very long 
>> queries. 16.7 million docs.
>> 
>> The whole cascading failure thing in search engines is hard. The first time 
>> I hit this was at Infoseek, over twenty years ago.
>> 
>>> On Mar 18, 2017, at 12:46 PM, Erick Erickson <erickerick...@gmail.com> 
>>> wrote:
>>> 
>>> bug# 2, Solr shouldn't be adding replicas by itself unless you
>>> specified autoAddReplicas=true when you created the collection. It
>>> default to "false". So I'm not sure what's going on here.
>> 
>>    "autoAddReplicas":"false",
>> 
>> in both collections. I thought that only worked with HDFS anyway.
>> 
>>> bug #3. The internal load balancers are round-robin, so this is
>>> expected. Not optimal I'll grant but expected.
>> 
>> Right. Still a bug. Should be round-robin on instances, not cores.
>> 
>>> bug #4. What shard placement rules are you using? There are a series
>>> of rules for replica placement and one of the criteria (IIRC) is
>>> exactly to try to distribute replicas to different hosts. Although
>>> there was some glitchiness whether two JVMs on the same _host_ were
>>> considered "the same host" or not.
>> 
>> Separate Amazon EC2 instances, one JVM per instance, no rules, other than 
>> the default.
>> 
>>    "maxShardsPerNode":"1",
>> 
>>> bug #1 has been more or less of a pain for quite a while, work is ongoing 
>>> there.
>> 
>> Glad to share our logs.
>> 
>> wunder
>> 
>>> FWIW,
>>> Erick
>>> 
>>> On Fri, Mar 17, 2017 at 5:40 PM, Walter Underwood <wun...@wunderwood.org> 
>>> wrote:
>>>> I’m running a 4x4 cluster (4 shards, replication factor 4) on 16 hosts. I 
>>>> shut down Solr on one host because it got into some kind of bad, 
>>>> can’t-recover state where it was causing timeouts across the whole cluster 
>>>> (bug #1).
>>>> 
>>>> I ran a load benchmark near the capacity of the cluster. This had run fine 
>>>> in test, this was the prod cluster.
>>>> 
>>>> Solr Cloud added a replica to replace the down node. The node with two 
>>>> cores got double the traffic and started slowly flapping in and out of 
>>>> service. The 95th percentile response spiked from 3 seconds to 100 
>>>> seconds. At some point, another replica was made, with two replicas from 
>>>> the same shard on the same instance. Naturally, that was overloaded, and I 
>>>> killed the benchmark out of charity.
>>>> 
>>>> Bug #2 is creating a new replica when one host is down. This should be an 
>>>> option and default to “false”, because it causes the cascade.
>>>> 
>>>> Bug #3 is sending equal traffic to each core without considering the host. 
>>>> Each host should get equal traffic, not each core.
>>>> 
>>>> Bug #4 is putting two replicas from the same shard on one instance. That 
>>>> is just asking for trouble.
>>>> 
>>>> When it works, this cluster is awesome.
>>>> 
>>>> wunder
>>>> Walter Underwood
>>>> wun...@wunderwood.org
>>>> http://observer.wunderwood.org/  (my blog)
>>>> 
>>>> 
>>

Re: Cascading failures with replicas

Reply via email to