Re: [PR] Use multi-select instead of a full sort for DynamicRange creation [lucene]

via GitHub Fri, 25 Oct 2024 10:20:51 -0700


HoustonPutman commented on code in PR #13914:
URL: https://github.com/apache/lucene/pull/13914#discussion_r1810967900



##########
lucene/facet/src/java/org/apache/lucene/facet/range/DynamicRangeUtil.java:
##########
@@ -202,66 +208,83 @@ public SegmentOutput(int hitsLength) {
    *     is used to compute the equi-weight per bin.
    */
   public static List<DynamicRangeInfo> computeDynamicNumericRanges(
-      long[] values, long[] weights, int len, long totalWeight, int topN) {
+      long[] values, long[] weights, int len, long totalValue, long 
totalWeight, int topN) {
     assert values.length == weights.length && len <= values.length && len >= 0;
     assert topN >= 0;
     List<DynamicRangeInfo> dynamicRangeResult = new ArrayList<>();
     if (len == 0 || topN == 0) {
       return dynamicRangeResult;
     }
 
-    new InPlaceMergeSorter() {
-      @Override
-      protected int compare(int index1, int index2) {
-        int cmp = Long.compare(values[index1], values[index2]);
-        if (cmp == 0) {
-          // If the values are equal, sort based on the weights.
-          // Any weight order is correct as long as it's deterministic.
-          return Long.compare(weights[index1], weights[index2]);
-        }
-        return cmp;
-      }
+    double rangeWeightTarget = (double) totalWeight / topN;
+    double[] kWeights = new double[topN];
+    for (int i = 0; i < topN; i++) {
+      kWeights[i] = (i == 0 ? 0 : kWeights[i - 1]) + rangeWeightTarget;

Review Comment:
   Wow yeah, both are better (though I like the first). This is the beauty of 
PR reviews haha. When you are 500 lines into a change, who knows what dumb 
things you will write...



##########
lucene/core/src/java/org/apache/lucene/util/WeightedSelector.java:
##########
@@ -0,0 +1,407 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util;
+
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.SplittableRandom;
+
+/**
+ * Adaptive selection algorithm based on the introspective quick select 
algorithm. The quick select
+ * algorithm uses an interpolation variant of Tukey's ninther 
median-of-medians for pivot, and
+ * Bentley-McIlroy 3-way partitioning. For the introspective protection, it 
shuffles the sub-range
+ * if the max recursive depth is exceeded.
+ *
+ * <p>This selection algorithm is fast on most data shapes, especially on 
nearly sorted data, or
+ * when k is close to the boundaries. It runs in linear time on average.
+ *
+ * @lucene.internal
+ */
+public abstract class WeightedSelector {
+
+  // This selector is used repeatedly by the radix selector for sub-ranges of 
less than
+  // 100 entries. This means this selector is also optimized to be fast on 
small ranges.
+  // It uses the variant of medians-of-medians and 3-way partitioning, and 
finishes the
+  // last tiny range (3 entries or less) with a very specialized sort.
+
+  private SplittableRandom random;
+
+  protected abstract long getWeight(int i);
+
+  protected abstract long getValue(int i);
+
+  public final WeightRangeInfo[] select(

Review Comment:
   Absolutely. Was going to go through and add docs, just wanted to make sure 
it was a good direction to go in first. Probably worth doing the benchmarking 
first 🥹



##########
lucene/core/src/java/org/apache/lucene/util/WeightedSelector.java:
##########
@@ -0,0 +1,407 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util;
+
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.SplittableRandom;
+
+/**
+ * Adaptive selection algorithm based on the introspective quick select 
algorithm. The quick select
+ * algorithm uses an interpolation variant of Tukey's ninther 
median-of-medians for pivot, and
+ * Bentley-McIlroy 3-way partitioning. For the introspective protection, it 
shuffles the sub-range
+ * if the max recursive depth is exceeded.
+ *
+ * <p>This selection algorithm is fast on most data shapes, especially on 
nearly sorted data, or
+ * when k is close to the boundaries. It runs in linear time on average.
+ *
+ * @lucene.internal
+ */
+public abstract class WeightedSelector {
+
+  // This selector is used repeatedly by the radix selector for sub-ranges of 
less than
+  // 100 entries. This means this selector is also optimized to be fast on 
small ranges.
+  // It uses the variant of medians-of-medians and 3-way partitioning, and 
finishes the
+  // last tiny range (3 entries or less) with a very specialized sort.
+
+  private SplittableRandom random;
+
+  protected abstract long getWeight(int i);
+
+  protected abstract long getValue(int i);
+
+  public final WeightRangeInfo[] select(
+      int from,
+      int to,
+      long rangeTotalValue,
+      long beforeTotalValue,
+      long rangeWeight,
+      long beforeWeight,
+      double[] kWeights) {
+    WeightRangeInfo[] kIndexResults = new WeightRangeInfo[kWeights.length];
+    Arrays.fill(kIndexResults, new WeightRangeInfo(-1, 0, 0));
+    checkArgs(rangeWeight, beforeWeight, kWeights);
+    select(
+        from,
+        to,
+        rangeTotalValue,
+        beforeTotalValue,
+        rangeWeight,
+        beforeWeight,
+        kWeights,
+        0,
+        kWeights.length,
+        kIndexResults,
+        2 * MathUtil.log(to - from, 2));
+    return kIndexResults;
+  }
+
+  void checkArgs(long rangeWeight, long beforeWeight, double[] kWeights) {
+    if (kWeights.length < 1) {
+      throw new IllegalArgumentException("There must be at least one k to 
select, none given");
+    }
+    Arrays.sort(kWeights);
+    if (kWeights[0] < beforeWeight) {
+      throw new IllegalArgumentException("All kWeights must be >= 
beforeWeight");
+    }
+    if (kWeights[kWeights.length - 1] > beforeWeight + rangeWeight) {
+      throw new IllegalArgumentException("All kWeights must be < beforeWeight 
+ rangeWeight");
+    }
+  }
+
+  // Visible for testing.
+  void select(
+      int from,
+      int to,
+      long rangeTotalValue,
+      long beforeTotalValue,
+      long rangeWeight,
+      long beforeWeight,
+      double[] kWeights,
+      int kFrom,
+      int kTo,
+      WeightRangeInfo[] kIndexResults,
+      int maxDepth) {
+
+    // This code is inspired from IntroSorter#sort, adapted to loop on a 
single partition.
+
+    // For efficiency, we must enter the loop with at least 4 entries to be 
able to skip
+    // some boundary tests during the 3-way partitioning.
+    int size;
+    if ((size = to - from) > 3) {
+
+      if (--maxDepth == -1) {
+        // Max recursion depth exceeded: shuffle (only once) and continue.
+        shuffle(from, to);
+      }
+
+      // Pivot selection based on medians.
+      int last = to - 1;
+      int mid = (from + last) >>> 1;
+      int pivot;
+      if (size <= IntroSorter.SINGLE_MEDIAN_THRESHOLD) {
+        // Select the pivot with a single median around the middle element.
+        // Do not take the median between [from, mid, last] because it hurts 
performance
+        // if the order is descending in conjunction with the 3-way 
partitioning.
+        int range = size >> 2;
+        pivot = median(mid - range, mid, mid + range);
+      } else {
+        // Select the pivot with a variant of the Tukey's ninther median of 
medians.
+        // If k is close to the boundaries, select either the lowest or 
highest median (this variant
+        // is inspired from the interpolation search).
+        int range = size >> 3;
+        int doubleRange = range << 1;
+        int medianFirst = median(from, from + range, from + doubleRange);
+        int medianMiddle = median(mid - range, mid, mid + range);
+        int medianLast = median(last - doubleRange, last - range, last);
+
+        double avgWeight = ((double) rangeWeight) / (to - from);
+        double middleWeight = kWeights[(kFrom + kTo - 1) >> 1];
+        // Approximate the k we are trying to find by assuming an equal weight 
amongst values
+        int middleK = from + (int) ((middleWeight - beforeWeight) / avgWeight);
+        if (middleK - from < range) {
+          // k is close to 'from': select the lowest median.
+          pivot = min(medianFirst, medianMiddle, medianLast);
+        } else if (to - middleK <= range) {
+          // k is close to 'to': select the highest median.
+          pivot = max(medianFirst, medianMiddle, medianLast);
+        } else {
+          // Otherwise select the median of medians.
+          pivot = median(medianFirst, medianMiddle, medianLast);
+        }
+      }
+
+      // Bentley-McIlroy 3-way partitioning.
+      setPivot(pivot);
+      swap(from, pivot);
+      int i = from;
+      int j = to;
+      int p = from + 1;
+      int q = last;
+      long leftTotalValue = 0;
+      long leftWeight = 0;
+      long rightTotalValue = 0;
+      long rightWeight = 0;
+      while (true) {
+        int leftCmp, rightCmp;
+        while ((leftCmp = comparePivot(++i)) > 0) {
+          leftTotalValue += getValue(i);
+          leftWeight += getWeight(i);
+        }
+        while ((rightCmp = comparePivot(--j)) < 0) {
+          rightTotalValue += getValue(j);
+          rightWeight += getWeight(j);
+        }
+        if (i >= j) {
+          if (i == j && rightCmp == 0) {
+            swap(i, p);
+          }
+          break;
+        }
+        swap(i, j);
+        if (rightCmp == 0) {
+          swap(i, p++);
+        } else {
+          leftTotalValue += getValue(i);
+          leftWeight += getWeight(i);
+        }
+        if (leftCmp == 0) {
+          swap(j, q--);
+        } else {
+          rightTotalValue += getValue(j);
+          rightWeight += getWeight(j);
+        }
+      }
+      i = j + 1;
+      for (int l = from; l < p; ) {
+        swap(l++, j--);
+      }
+      for (int l = last; l > q; ) {
+        swap(l--, i++);
+      }
+      long leftWeightEnd = beforeWeight + leftWeight;
+      long rightWeightStart = beforeWeight + rangeWeight - rightWeight;
+
+      // Select the K weight values contained in the bottom and top partitions.
+      int topKFrom = kTo;
+      int bottomKTo = kFrom;
+      for (int ki = kTo - 1; ki >= kFrom; ki--) {
+        if (kWeights[ki] >= rightWeightStart) {
+          topKFrom = ki;
+        }
+        if (kWeights[ki] <= leftWeightEnd) {
+          bottomKTo = ki + 1;
+          break;
+        }
+      }
+      // Recursively select the relevant k-values from the bottom group, if 
there are any k-values
+      // to select there
+      if (bottomKTo > kFrom) {
+        select(
+            from,
+            j + 1,
+            leftTotalValue,
+            beforeTotalValue,
+            leftWeight,
+            beforeWeight,
+            kWeights,
+            kFrom,
+            bottomKTo,
+            kIndexResults,
+            maxDepth);
+      }
+      // Recursively select the relevant k-values from the top group, if there 
are any k-values to
+      // select there
+      if (topKFrom < kTo) {
+        select(
+            i,
+            to,
+            rightTotalValue,
+            beforeTotalValue + rangeTotalValue - rightTotalValue,
+            rightWeight,
+            beforeWeight + rangeWeight - rightWeight,
+            kWeights,
+            topKFrom,
+            kTo,
+            kIndexResults,
+            maxDepth);
+      }
+
+      // Choose the k result indexes for this partition
+      if (bottomKTo < topKFrom) {
+        findKIndexes(
+            j + 1,
+            i,
+            beforeTotalValue + leftTotalValue,
+            beforeWeight + leftWeight,
+            bottomKTo,
+            topKFrom,
+            kWeights,
+            kIndexResults);
+      }
+    }
+
+    // Sort the final tiny range (3 entries or less) with a very specialized 
sort.
+    switch (size) {
+      case 1:
+        kIndexResults[kTo - 1] =
+            new WeightRangeInfo(
+                from, beforeTotalValue + getValue(from), beforeWeight + 
getWeight(from));
+        break;
+      case 2:
+        if (compare(from, from + 1) > 0) {
+          swap(from, from + 1);
+        }
+        findKIndexes(
+            from, from + 2, beforeTotalValue, beforeWeight, kFrom, kTo, 
kWeights, kIndexResults);
+        break;
+      case 3:
+        sort3(from);
+        findKIndexes(
+            from, from + 3, beforeTotalValue, beforeWeight, kFrom, kTo, 
kWeights, kIndexResults);
+        break;
+    }
+  }
+
+  private void findKIndexes(
+      int from,
+      int to,
+      long beforeTotalValue,
+      long beforeWeight,
+      int kFrom,
+      int kTo,
+      double[] kWeights,
+      WeightRangeInfo[] kIndexResults) {
+    long runningWeight = beforeWeight;
+    long runningTotalValue = beforeTotalValue;
+    int kIdx = kFrom;
+    for (int listIdx = from; listIdx < to && kIdx < kTo; listIdx++) {
+      runningWeight += getWeight(listIdx);
+      runningTotalValue += getValue(listIdx);
+      // Skip ahead in the weight list if the same value is used for multiple 
weights, we will only
+      // record a result index for the last weight that matches it.
+      while (++kIdx < kTo && kWeights[kIdx] <= runningWeight) {}
+      if (kWeights[--kIdx] <= runningWeight) {
+        kIndexResults[kIdx] = new WeightRangeInfo(listIdx, runningTotalValue, 
runningWeight);
+        // Now that we have recorded the resultIndex for this weight, go to 
the next weight
+        kIdx++;
+      }
+    }
+  }
+
+  /** Returns the index of the min element among three elements at provided 
indices. */
+  private int min(int i, int j, int k) {
+    if (compare(i, j) <= 0) {
+      return compare(i, k) <= 0 ? i : k;
+    }
+    return compare(j, k) <= 0 ? j : k;
+  }
+
+  /** Returns the index of the max element among three elements at provided 
indices. */
+  private int max(int i, int j, int k) {
+    if (compare(i, j) <= 0) {
+      return compare(j, k) < 0 ? k : j;
+    }
+    return compare(i, k) < 0 ? k : i;
+  }
+
+  /** Copy of {@code IntroSorter#median}. */
+  private int median(int i, int j, int k) {
+    if (compare(i, j) < 0) {
+      if (compare(j, k) <= 0) {
+        return j;
+      }
+      return compare(i, k) < 0 ? k : i;
+    }
+    if (compare(j, k) >= 0) {
+      return j;
+    }
+    return compare(i, k) < 0 ? i : k;
+  }
+
+  /**
+   * Sorts 3 entries starting at from (inclusive). This specialized method is 
more efficient than
+   * calling {@link Sorter#insertionSort(int, int)}.
+   */
+  private void sort3(int from) {
+    final int mid = from + 1;
+    final int last = from + 2;
+    if (compare(from, mid) <= 0) {
+      if (compare(mid, last) > 0) {
+        swap(mid, last);
+        if (compare(from, mid) > 0) {
+          swap(from, mid);
+        }
+      }
+    } else if (compare(mid, last) >= 0) {
+      swap(from, last);
+    } else {
+      swap(from, mid);
+      if (compare(mid, last) > 0) {
+        swap(mid, last);
+      }
+    }
+  }
+
+  /**
+   * Shuffles the entries between from (inclusive) and to (exclusive) with 
Durstenfeld's algorithm.
+   */
+  private void shuffle(int from, int to) {
+    if (this.random == null) {
+      this.random = new SplittableRandom();
+    }
+    SplittableRandom random = this.random;
+    for (int i = to - 1; i > from; i--) {
+      swap(i, random.nextInt(from, i + 1));

Review Comment:
   I haven't even looked at this method. It was straight copied from 
`IntroSelector`.
   
   After doing some research, this seems to be the right way of doing it 
according to the algorithm they specified: 
https://en.wikipedia.org/wiki/Fisher–Yates_shuffle#The_modern_algorithm



##########
lucene/core/src/java/org/apache/lucene/util/WeightedSelector.java:
##########
@@ -0,0 +1,407 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util;
+
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.SplittableRandom;
+
+/**
+ * Adaptive selection algorithm based on the introspective quick select 
algorithm. The quick select
+ * algorithm uses an interpolation variant of Tukey's ninther 
median-of-medians for pivot, and
+ * Bentley-McIlroy 3-way partitioning. For the introspective protection, it 
shuffles the sub-range
+ * if the max recursive depth is exceeded.
+ *
+ * <p>This selection algorithm is fast on most data shapes, especially on 
nearly sorted data, or
+ * when k is close to the boundaries. It runs in linear time on average.
+ *
+ * @lucene.internal
+ */
+public abstract class WeightedSelector {
+
+  // This selector is used repeatedly by the radix selector for sub-ranges of 
less than
+  // 100 entries. This means this selector is also optimized to be fast on 
small ranges.
+  // It uses the variant of medians-of-medians and 3-way partitioning, and 
finishes the
+  // last tiny range (3 entries or less) with a very specialized sort.
+
+  private SplittableRandom random;
+
+  protected abstract long getWeight(int i);
+
+  protected abstract long getValue(int i);
+
+  public final WeightRangeInfo[] select(
+      int from,
+      int to,
+      long rangeTotalValue,
+      long beforeTotalValue,
+      long rangeWeight,
+      long beforeWeight,
+      double[] kWeights) {
+    WeightRangeInfo[] kIndexResults = new WeightRangeInfo[kWeights.length];
+    Arrays.fill(kIndexResults, new WeightRangeInfo(-1, 0, 0));
+    checkArgs(rangeWeight, beforeWeight, kWeights);
+    select(
+        from,
+        to,
+        rangeTotalValue,
+        beforeTotalValue,
+        rangeWeight,
+        beforeWeight,
+        kWeights,
+        0,
+        kWeights.length,
+        kIndexResults,
+        2 * MathUtil.log(to - from, 2));
+    return kIndexResults;
+  }
+
+  void checkArgs(long rangeWeight, long beforeWeight, double[] kWeights) {
+    if (kWeights.length < 1) {
+      throw new IllegalArgumentException("There must be at least one k to 
select, none given");
+    }
+    Arrays.sort(kWeights);
+    if (kWeights[0] < beforeWeight) {
+      throw new IllegalArgumentException("All kWeights must be >= 
beforeWeight");
+    }
+    if (kWeights[kWeights.length - 1] > beforeWeight + rangeWeight) {
+      throw new IllegalArgumentException("All kWeights must be < beforeWeight 
+ rangeWeight");
+    }
+  }
+
+  // Visible for testing.
+  void select(
+      int from,
+      int to,
+      long rangeTotalValue,
+      long beforeTotalValue,
+      long rangeWeight,
+      long beforeWeight,
+      double[] kWeights,
+      int kFrom,
+      int kTo,
+      WeightRangeInfo[] kIndexResults,
+      int maxDepth) {
+
+    // This code is inspired from IntroSorter#sort, adapted to loop on a 
single partition.
+
+    // For efficiency, we must enter the loop with at least 4 entries to be 
able to skip
+    // some boundary tests during the 3-way partitioning.
+    int size;
+    if ((size = to - from) > 3) {
+
+      if (--maxDepth == -1) {
+        // Max recursion depth exceeded: shuffle (only once) and continue.

Review Comment:
   This is from `IntroSelector`, but basically I think it's saying if I've done 
enough recursions in QuickSelect, that means that our data has a really bad 
distribution? So just randomize it a bit and continue. I don't have an opinion 
as I haven't studied it, but hopefully there was research put into the idea?



##########
lucene/core/src/java/org/apache/lucene/util/WeightedSelector.java:
##########
@@ -0,0 +1,407 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util;
+
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.SplittableRandom;
+
+/**
+ * Adaptive selection algorithm based on the introspective quick select 
algorithm. The quick select
+ * algorithm uses an interpolation variant of Tukey's ninther 
median-of-medians for pivot, and
+ * Bentley-McIlroy 3-way partitioning. For the introspective protection, it 
shuffles the sub-range
+ * if the max recursive depth is exceeded.
+ *
+ * <p>This selection algorithm is fast on most data shapes, especially on 
nearly sorted data, or
+ * when k is close to the boundaries. It runs in linear time on average.
+ *
+ * @lucene.internal
+ */
+public abstract class WeightedSelector {
+
+  // This selector is used repeatedly by the radix selector for sub-ranges of 
less than
+  // 100 entries. This means this selector is also optimized to be fast on 
small ranges.
+  // It uses the variant of medians-of-medians and 3-way partitioning, and 
finishes the
+  // last tiny range (3 entries or less) with a very specialized sort.
+
+  private SplittableRandom random;
+
+  protected abstract long getWeight(int i);
+
+  protected abstract long getValue(int i);
+
+  public final WeightRangeInfo[] select(
+      int from,
+      int to,
+      long rangeTotalValue,
+      long beforeTotalValue,
+      long rangeWeight,
+      long beforeWeight,
+      double[] kWeights) {
+    WeightRangeInfo[] kIndexResults = new WeightRangeInfo[kWeights.length];
+    Arrays.fill(kIndexResults, new WeightRangeInfo(-1, 0, 0));
+    checkArgs(rangeWeight, beforeWeight, kWeights);
+    select(
+        from,
+        to,
+        rangeTotalValue,
+        beforeTotalValue,
+        rangeWeight,
+        beforeWeight,
+        kWeights,
+        0,
+        kWeights.length,
+        kIndexResults,
+        2 * MathUtil.log(to - from, 2));
+    return kIndexResults;
+  }
+
+  void checkArgs(long rangeWeight, long beforeWeight, double[] kWeights) {
+    if (kWeights.length < 1) {
+      throw new IllegalArgumentException("There must be at least one k to 
select, none given");
+    }
+    Arrays.sort(kWeights);
+    if (kWeights[0] < beforeWeight) {
+      throw new IllegalArgumentException("All kWeights must be >= 
beforeWeight");
+    }
+    if (kWeights[kWeights.length - 1] > beforeWeight + rangeWeight) {
+      throw new IllegalArgumentException("All kWeights must be < beforeWeight 
+ rangeWeight");
+    }
+  }
+
+  // Visible for testing.
+  void select(

Review Comment:
   Yeah, the 3-way partitioning was also quite confusing to me until I looked 
it up. And even then, the code is still quite hard to understand. I copied the 
default implementation from `IntroSelector`, then modified it to support 
multi-select, and select by cumulative weight, not by ordinal. So a lot of the 
complexity/confusion I can't necessarily speak to. Maybe this would be clearer 
if in the Javadocs of the class, it called out `IntroSelector` as the base 
algorithm?



##########
lucene/core/src/java/org/apache/lucene/util/WeightedSelector.java:
##########
@@ -0,0 +1,407 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util;
+
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.SplittableRandom;
+
+/**
+ * Adaptive selection algorithm based on the introspective quick select 
algorithm. The quick select
+ * algorithm uses an interpolation variant of Tukey's ninther 
median-of-medians for pivot, and
+ * Bentley-McIlroy 3-way partitioning. For the introspective protection, it 
shuffles the sub-range
+ * if the max recursive depth is exceeded.
+ *
+ * <p>This selection algorithm is fast on most data shapes, especially on 
nearly sorted data, or
+ * when k is close to the boundaries. It runs in linear time on average.
+ *
+ * @lucene.internal
+ */
+public abstract class WeightedSelector {
+
+  // This selector is used repeatedly by the radix selector for sub-ranges of 
less than
+  // 100 entries. This means this selector is also optimized to be fast on 
small ranges.
+  // It uses the variant of medians-of-medians and 3-way partitioning, and 
finishes the
+  // last tiny range (3 entries or less) with a very specialized sort.
+
+  private SplittableRandom random;
+
+  protected abstract long getWeight(int i);
+
+  protected abstract long getValue(int i);
+
+  public final WeightRangeInfo[] select(
+      int from,
+      int to,
+      long rangeTotalValue,
+      long beforeTotalValue,
+      long rangeWeight,
+      long beforeWeight,
+      double[] kWeights) {

Review Comment:
   That's a very fair point. I struggled naming this. Basically the `k` prefix 
is for choosing where to select. So `kWeights` is the weight-cutoffs that you 
want to select. if you have a total weight of 100 and want to group into 5, 
then `kWeights` would be `[20,40,60,80,100]`.  Very open to better naming 
anywhere!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Use multi-select instead of a full sort for DynamicRange creation [lucene]

Reply via email to