[GitHub] [lucene] gsmiller commented on a change in pull request #127: LUCENE-9946: Support multi-value fields in range facet counting

GitBox Wed, 05 May 2021 15:06:32 -0700


gsmiller commented on a change in pull request #127:
URL: https://github.com/apache/lucene/pull/127#discussion_r626951160




##########
File path: 
lucene/facet/src/java/org/apache/lucene/facet/range/LongRangeCounter.java
##########
@@ -17,26 +17,153 @@
 package org.apache.lucene.facet.range;
 
 import java.util.ArrayList;
+import java.util.Arrays;
 import java.util.Collections;
 import java.util.HashMap;
 import java.util.List;
 import java.util.Map;
 
 /**
- * Counts how many times each range was seen; per-hit it's just a binary 
search ({@link #add})
- * against the elementary intervals, and in the end we rollup back to the 
original ranges.
+ * Segment tree for counting numeric ranges. Works for both single- and 
multi-valued cases (assuming
+ * you use it correctly).
+ *
+ * <p>Usage notes: For counting against a single value field/source, callers 
should call add() for
+ * each value and then call finish() after all documents have been processed. 
The call to finish()
+ * will inform the caller how many documents didn't match against any ranges. 
After finish() has
+ * been called, the caller-provided count buffer (passed into the ctor) will 
be populated with
+ * accurate range counts.
+ *
+ * <p>For counting against a multi-valued field, callers should call 
startDoc() at the beginning of
+ * processing each doc, followed by add() for each value, and then endDoc() at 
the end of the doc.
+ * The call to endDoc() will inform the caller if that document matched 
against any ranges. It is
+ * not necessary to call finish() for multi-valued cases. After each call to 
endDoc(), the
+ * caller-provided count buffer (passed into the ctor) will be populated with 
accurate range counts.
  */
 final class LongRangeCounter {
 
-  final LongRangeNode root;
-  final long[] boundaries;
-  final int[] leafCounts;
+  /** segment tree root node */
+  private final LongRangeNode root;
+  /** elementary segment boundaries used for efficient counting (bsearch to 
find interval) */
+  private final long[] boundaries;
+  /** accumulated counts for all of the ranges */
+  private final int[] countBuffer;
+  /** whether-or-not we're counting docs that could be multi-valued */
+  final boolean isMultiValued;
+
+  // Needed only for counting single-valued docs:
+  /** counts seen in each elementary interval leaf */
+  private final int[] singleValuedLeafCounts;
+
+  // Needed only for counting multi-valued docs:
+  /** whether-or-not an elementary interval has seen at least one match for a 
single doc */
+  private final boolean[] multiValuedDocLeafHits;
+  /** whether-or-not a range has seen at least one match for a single doc */
+  private final boolean[] multiValuedDocRangeHits;
 
   // Used during rollup
   private int leafUpto;
   private int missingCount;
 
-  public LongRangeCounter(LongRange[] ranges) {
+  LongRangeCounter(LongRange[] ranges, int[] countBuffer, boolean 
isMultiValued) {
+    // Whether-or-not we're processing docs that could be multi-valued:
+    this.isMultiValued = isMultiValued;
+
+    // We'll populate the user-provided count buffer with range counts:
+    this.countBuffer = countBuffer;
+
+    // Build elementary intervals:
+    List<InclusiveRange> elementaryIntervals = 
buildElementaryIntervals(ranges);
+
+    // Build binary tree on top of intervals:
+    root = split(0, elementaryIntervals.size(), elementaryIntervals);
+
+    // Set outputs, so we know which range to output for each node in the tree:
+    for (int i = 0; i < ranges.length; i++) {
+      root.addOutputs(i, ranges[i]);
+    }
+
+    // Set boundaries (ends of each elementary interval):
+    boundaries = new long[elementaryIntervals.size()];
+    for (int i = 0; i < boundaries.length; i++) {
+      boundaries[i] = elementaryIntervals.get(i).end;
+    }
+
+    // Setup to count:
+    if (isMultiValued == false) {
+      // Setup to count single-valued docs only:
+      singleValuedLeafCounts = new int[boundaries.length];
+      multiValuedDocLeafHits = null;
+      multiValuedDocRangeHits = null;
+    } else {
+      // Setup to count multi-valued docs:
+      singleValuedLeafCounts = null;
+      multiValuedDocLeafHits = new boolean[boundaries.length];
+      multiValuedDocRangeHits = new boolean[ranges.length];
+    }
+  }
+
+  /**
+   * Start processing a new doc. It's unnecessary to call this for 
single-value cases (but it
+   * doesn't cause problems if you do).
+   */
+  void startDoc() {
+    if (isMultiValued) {
+      Arrays.fill(multiValuedDocLeafHits, false);
+    }
+  }
+
+  /**
+   * Finish processing a new doc. Returns whether-or-not the document 
contributed a count to at
+   * least one range. It's unnecessary to call this for single-value cases, 
and the return value in
+   * such cases will always be {@code true} (but calling it doesn't cause any 
problems).
+   */
+  boolean endDoc() {
+    // Necessary to rollup after each doc for multi-valued case:
+    if (isMultiValued) {
+      leafUpto = 0;
+      Arrays.fill(multiValuedDocRangeHits, false);
+      rollupMultiValued(root);
+      boolean docContributedToAtLeastOneRange = false;
+      for (int i = 0; i < multiValuedDocRangeHits.length; i++) {

Review comment:
       Interesting. I'm not sure I can speak with any authority on "typical" 
use cases, but it does seem like the range counting implementation in general 
is optimized to scale to a large number of ranges (e.g., paying the 
segment-tree setup costs to reduce runtime-complexity of matching ranges), so I 
lean towards doing the same. As for sparseness, I suspect that _will_ be a 
typical case. It's not unreasonable to expect any given (multi-valued) field 
would only match a small number of the ranges.
   
   Anyway, I like the `BitSet` suggestion and will go ahead and update this 
change to use it. It'd be really interesting to run a micro-bechmark to see the 
impact. I'll see if I can carve out some time for that in the next couple days. 
We could also explore that in a follow-up Jira if you don't think that's 
blocking. I'd be curious to dig into that either way.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a change in pull request #127: LUCENE-9946: Support multi-value fields in range facet counting

Reply via email to