[ https://issues.apache.org/jira/browse/SOLR-15109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Smiley reassigned SOLR-15109: ----------------------------------- Component/s: SolrCloud Assignee: David Smiley Summary: Optimize shard splitByPrefix logic to reduce number of splits required (was: Optimize splitByPrefix logic to reduce number of splits required) > Optimize shard splitByPrefix logic to reduce number of splits required > ---------------------------------------------------------------------- > > Key: SOLR-15109 > URL: https://issues.apache.org/jira/browse/SOLR-15109 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud > Reporter: Megan Carey > Assignee: David Smiley > Priority: Major > Attachments: Split 1 (1).png, Split 2 (1).png, Split 3 (1).png > > > The goal of SplitByPrefix logic is to identify "buckets" within a shard that > contain documents that should be co-located (according to their doc prefix), > and split such that those buckets are preserved. One issue that we have found > with splitByPrefix in practice is that it often takes several splits to > isolate a particularly large bucket within the hash range. > [~dsmiley] came up with a simple optimization that will reduce the number of > splits needed to isolate such a bucket: > {quote}Loop over all RangeCounts... does it intersect the middle third of the > input? If not, move on. If so, track the biggest. When this loop finishes, > you will have the biggest that also intersects the middle third. Then simply > choose the side of this biggest RangeCount that is closest to the middle of > the input range.{quote} > This should be clearer with the following diagrams: -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org