Re: [PR] Use multi-select instead of a full sort for DynamicRange creation [lucene]

via GitHub Wed, 12 Feb 2025 09:36:38 -0800


HoustonPutman commented on code in PR #13914:
URL: https://github.com/apache/lucene/pull/13914#discussion_r1953123437



##########
lucene/facet/src/java/org/apache/lucene/facet/range/DynamicRangeUtil.java:
##########
@@ -202,66 +208,83 @@ public SegmentOutput(int hitsLength) {
    *     is used to compute the equi-weight per bin.
    */
   public static List<DynamicRangeInfo> computeDynamicNumericRanges(
-      long[] values, long[] weights, int len, long totalWeight, int topN) {
+      long[] values, long[] weights, int len, long totalValue, long 
totalWeight, int topN) {
     assert values.length == weights.length && len <= values.length && len >= 0;
     assert topN >= 0;
     List<DynamicRangeInfo> dynamicRangeResult = new ArrayList<>();
     if (len == 0 || topN == 0) {
       return dynamicRangeResult;
     }
 
-    new InPlaceMergeSorter() {
-      @Override
-      protected int compare(int index1, int index2) {
-        int cmp = Long.compare(values[index1], values[index2]);
-        if (cmp == 0) {
-          // If the values are equal, sort based on the weights.
-          // Any weight order is correct as long as it's deterministic.
-          return Long.compare(weights[index1], weights[index2]);
-        }
-        return cmp;
-      }
+    double rangeWeightTarget = (double) totalWeight / topN;
+    double[] kWeights = new double[topN];
+    for (int i = 0; i < topN; i++) {
+      kWeights[i] = (i == 0 ? 0 : kWeights[i - 1]) + rangeWeightTarget;
+    }
 
-      @Override
-      protected void swap(int index1, int index2) {
-        long tmp = values[index1];
-        values[index1] = values[index2];
-        values[index2] = tmp;
-        tmp = weights[index1];
-        weights[index1] = weights[index2];
-        weights[index2] = tmp;
-      }
-    }.sort(0, len);
+    WeightedSelector.WeightRangeInfo[] kIndexResults =
+        new WeightedSelector() {
+          private long pivotValue;
+          private long pivotWeight;
 
-    long accuWeight = 0;
-    long valueSum = 0;
-    int count = 0;
-    int minIdx = 0;
+          @Override
+          protected long getWeight(int i) {
+            return weights[i];
+          }
 
-    double rangeWeightTarget = (double) totalWeight / Math.min(topN, len);
+          @Override
+          protected long getValue(int i) {
+            return values[i];
+          }
 
-    for (int i = 0; i < len; i++) {
-      accuWeight += weights[i];
-      valueSum += values[i];
-      count++;
+          @Override
+          protected void swap(int index1, int index2) {
+            long tmp = values[index1];
+            values[index1] = values[index2];
+            values[index2] = tmp;
+            tmp = weights[index1];
+            weights[index1] = weights[index2];
+            weights[index2] = tmp;
+          }
 
-      if (accuWeight >= rangeWeightTarget) {
+          @Override
+          protected void setPivot(int i) {
+            pivotValue = values[i];
+            pivotWeight = weights[i];
+          }
+
+          @Override
+          protected int comparePivot(int j) {
+            int cmp = Long.compare(pivotValue, values[j]);
+            if (cmp == 0) {
+              // If the values are equal, sort based on the weights.
+              // Any weight order is correct as long as it's deterministic.
+              return Long.compare(pivotWeight, weights[j]);
+            }
+            return cmp;
+          }
+        }.select(0, len, totalValue, 0, totalWeight, 0, kWeights);
+
+    int lastIdx = -1;
+    long lastTotalValue = 0;
+    long lastTotalWeight = 0;
+    for (int kIdx = 0; kIdx < topN; kIdx++) {

Review Comment:
   Yeah, so we should be finding either the `min` or the `max` here, given the 
algorithm needs to "select" given points and those points have to be one or the 
other. Your test tells me that this is the `max`, since the `min` values are 
incorrect.
   
   Another way of doing this is to also "select" the `min`, so modify the 
algorithm to require two matches instead of one for each bucket... I'd have to 
take some time to see what this would take, but I think it should definitely be 
faster.
   
   Note, given this change, we would be iterating through each bucket, this 
would just add `O(n)` time (O(n<sub>1</sub>) + O(n<sub>2</sub>) + ...), to the 
already ~`O(n)` quick-select algorithm. Still quicker than the `O(nlogn)` sort 
option. But I'd like to see if we could bake this into the algorithm in a 
smarter way



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Use multi-select instead of a full sort for DynamicRange creation [lucene]

Reply via email to