[ https://issues.apache.org/jira/browse/SOLR-14123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17000279#comment-17000279 ]
Andrzej Bialecki commented on SOLR-14123: ----------------------------------------- Also, {{autoAddReplicas}} is a reactive measure that reacts to {{nodeLost}} events, it's not a proactive measure that monitors the number of replicas. In some situations autoscaling events may be lost or their execution interrupted without the failure generating a re-try. For example, if a lost replica is being re-created on a node that itself goes down during replica creation the replica may stay lost - because it's not yet visible on the target node (it's still being created) but at this point the autoscaling trigger considers the original {{nodeLost}} event "handled" so it does not re-create the event. See SOLR-12749, and the description in SOLR-13828 (it was only partially fixed). > autoAddReplicas is not reliable when multiple nodes go down. > ------------------------------------------------------------ > > Key: SOLR-14123 > URL: https://issues.apache.org/jira/browse/SOLR-14123 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: AutoScaling > Affects Versions: 8.3 > Reporter: David Hunt > Priority: Major > Labels: autoscale > > I started noticing problems in our production environment with indexing being > blocked due to a minimum replication factor not being met. We have > autoAddReplicas triggers in place to add replicas when nodes our lost but it > doesn't seem to correctly add all replicas that have been lost when nodes are > lost. I’ve been able to reproduce this behavior consistently in a development > environment. > Repro: > # Setup a 10 node SolrCloud cluster. > # Create autoAddReplicas to trigger on nodeLost with waitFor set to 10 > minutes. > # Create 15 collections with 2 shards and 4 replicas. > # Kill 3 Solr nodes. > # 15 minutes later kill 1 more Solr node. > Results: > Monitor your shards/replicas. You’ll see some replicas added to make up for > the lost replicas but not all. An hour later many shards are still missing > replicas. > Expected: > All lost replicas should be added on the 6 remaining healthy nodes. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org