6.3.0. No idea how it is happening, but I got two replicas on the same host after one host went down.
wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Mar 18, 2017, at 8:35 PM, Erick Erickson <erickerick...@gmail.com> wrote: > > Hmmm, I'm totally mystified about how Solr is "creating a new replica > when one host is down". Are you saying this is happening > automagically? You're right the autoAddReplica bit is HDFS so having > replicas just show up is completely completely weird. In days past, > when a replica was discovered on disk when Solr started up, it would > reconstruct itself _even if the collection had been deleted_. They > would reappear in clusterstate.json, _not_ the individual state.json > files under collections in ZK. Is this at all possible? > > What version of Solr are you using anyway? The reconstruction I > mentioned above is 4x IIRC. > > Best, > Erick > > On Sat, Mar 18, 2017 at 5:45 PM, Walter Underwood <wun...@wunderwood.org> > wrote: >> Thanks. This is a very CPU-heavy workload, with ngram fields and very long >> queries. 16.7 million docs. >> >> The whole cascading failure thing in search engines is hard. The first time >> I hit this was at Infoseek, over twenty years ago. >> >>> On Mar 18, 2017, at 12:46 PM, Erick Erickson <erickerick...@gmail.com> >>> wrote: >>> >>> bug# 2, Solr shouldn't be adding replicas by itself unless you >>> specified autoAddReplicas=true when you created the collection. It >>> default to "false". So I'm not sure what's going on here. >> >> "autoAddReplicas":"false", >> >> in both collections. I thought that only worked with HDFS anyway. >> >>> bug #3. The internal load balancers are round-robin, so this is >>> expected. Not optimal I'll grant but expected. >> >> Right. Still a bug. Should be round-robin on instances, not cores. >> >>> bug #4. What shard placement rules are you using? There are a series >>> of rules for replica placement and one of the criteria (IIRC) is >>> exactly to try to distribute replicas to different hosts. Although >>> there was some glitchiness whether two JVMs on the same _host_ were >>> considered "the same host" or not. >> >> Separate Amazon EC2 instances, one JVM per instance, no rules, other than >> the default. >> >> "maxShardsPerNode":"1", >> >>> bug #1 has been more or less of a pain for quite a while, work is ongoing >>> there. >> >> Glad to share our logs. >> >> wunder >> >>> FWIW, >>> Erick >>> >>> On Fri, Mar 17, 2017 at 5:40 PM, Walter Underwood <wun...@wunderwood.org> >>> wrote: >>>> I’m running a 4x4 cluster (4 shards, replication factor 4) on 16 hosts. I >>>> shut down Solr on one host because it got into some kind of bad, >>>> can’t-recover state where it was causing timeouts across the whole cluster >>>> (bug #1). >>>> >>>> I ran a load benchmark near the capacity of the cluster. This had run fine >>>> in test, this was the prod cluster. >>>> >>>> Solr Cloud added a replica to replace the down node. The node with two >>>> cores got double the traffic and started slowly flapping in and out of >>>> service. The 95th percentile response spiked from 3 seconds to 100 >>>> seconds. At some point, another replica was made, with two replicas from >>>> the same shard on the same instance. Naturally, that was overloaded, and I >>>> killed the benchmark out of charity. >>>> >>>> Bug #2 is creating a new replica when one host is down. This should be an >>>> option and default to “false”, because it causes the cascade. >>>> >>>> Bug #3 is sending equal traffic to each core without considering the host. >>>> Each host should get equal traffic, not each core. >>>> >>>> Bug #4 is putting two replicas from the same shard on one instance. That >>>> is just asking for trouble. >>>> >>>> When it works, this cluster is awesome. >>>> >>>> wunder >>>> Walter Underwood >>>> wun...@wunderwood.org >>>> http://observer.wunderwood.org/ (my blog) >>>> >>>> >>