gsmiller commented on a change in pull request #127: URL: https://github.com/apache/lucene/pull/127#discussion_r626951160
########## File path: lucene/facet/src/java/org/apache/lucene/facet/range/LongRangeCounter.java ########## @@ -17,26 +17,153 @@ package org.apache.lucene.facet.range; import java.util.ArrayList; +import java.util.Arrays; import java.util.Collections; import java.util.HashMap; import java.util.List; import java.util.Map; /** - * Counts how many times each range was seen; per-hit it's just a binary search ({@link #add}) - * against the elementary intervals, and in the end we rollup back to the original ranges. + * Segment tree for counting numeric ranges. Works for both single- and multi-valued cases (assuming + * you use it correctly). + * + * <p>Usage notes: For counting against a single value field/source, callers should call add() for + * each value and then call finish() after all documents have been processed. The call to finish() + * will inform the caller how many documents didn't match against any ranges. After finish() has + * been called, the caller-provided count buffer (passed into the ctor) will be populated with + * accurate range counts. + * + * <p>For counting against a multi-valued field, callers should call startDoc() at the beginning of + * processing each doc, followed by add() for each value, and then endDoc() at the end of the doc. + * The call to endDoc() will inform the caller if that document matched against any ranges. It is + * not necessary to call finish() for multi-valued cases. After each call to endDoc(), the + * caller-provided count buffer (passed into the ctor) will be populated with accurate range counts. */ final class LongRangeCounter { - final LongRangeNode root; - final long[] boundaries; - final int[] leafCounts; + /** segment tree root node */ + private final LongRangeNode root; + /** elementary segment boundaries used for efficient counting (bsearch to find interval) */ + private final long[] boundaries; + /** accumulated counts for all of the ranges */ + private final int[] countBuffer; + /** whether-or-not we're counting docs that could be multi-valued */ + final boolean isMultiValued; + + // Needed only for counting single-valued docs: + /** counts seen in each elementary interval leaf */ + private final int[] singleValuedLeafCounts; + + // Needed only for counting multi-valued docs: + /** whether-or-not an elementary interval has seen at least one match for a single doc */ + private final boolean[] multiValuedDocLeafHits; + /** whether-or-not a range has seen at least one match for a single doc */ + private final boolean[] multiValuedDocRangeHits; // Used during rollup private int leafUpto; private int missingCount; - public LongRangeCounter(LongRange[] ranges) { + LongRangeCounter(LongRange[] ranges, int[] countBuffer, boolean isMultiValued) { + // Whether-or-not we're processing docs that could be multi-valued: + this.isMultiValued = isMultiValued; + + // We'll populate the user-provided count buffer with range counts: + this.countBuffer = countBuffer; + + // Build elementary intervals: + List<InclusiveRange> elementaryIntervals = buildElementaryIntervals(ranges); + + // Build binary tree on top of intervals: + root = split(0, elementaryIntervals.size(), elementaryIntervals); + + // Set outputs, so we know which range to output for each node in the tree: + for (int i = 0; i < ranges.length; i++) { + root.addOutputs(i, ranges[i]); + } + + // Set boundaries (ends of each elementary interval): + boundaries = new long[elementaryIntervals.size()]; + for (int i = 0; i < boundaries.length; i++) { + boundaries[i] = elementaryIntervals.get(i).end; + } + + // Setup to count: + if (isMultiValued == false) { + // Setup to count single-valued docs only: + singleValuedLeafCounts = new int[boundaries.length]; + multiValuedDocLeafHits = null; + multiValuedDocRangeHits = null; + } else { + // Setup to count multi-valued docs: + singleValuedLeafCounts = null; + multiValuedDocLeafHits = new boolean[boundaries.length]; + multiValuedDocRangeHits = new boolean[ranges.length]; + } + } + + /** + * Start processing a new doc. It's unnecessary to call this for single-value cases (but it + * doesn't cause problems if you do). + */ + void startDoc() { + if (isMultiValued) { + Arrays.fill(multiValuedDocLeafHits, false); + } + } + + /** + * Finish processing a new doc. Returns whether-or-not the document contributed a count to at + * least one range. It's unnecessary to call this for single-value cases, and the return value in + * such cases will always be {@code true} (but calling it doesn't cause any problems). + */ + boolean endDoc() { + // Necessary to rollup after each doc for multi-valued case: + if (isMultiValued) { + leafUpto = 0; + Arrays.fill(multiValuedDocRangeHits, false); + rollupMultiValued(root); + boolean docContributedToAtLeastOneRange = false; + for (int i = 0; i < multiValuedDocRangeHits.length; i++) { Review comment: Interesting. I'm not sure I can speak with any authority on "typical" use cases, but it does seem like the range counting implementation in general is optimized to scale to a large number of ranges (e.g., paying the segment-tree setup costs to reduce runtime-complexity of matching ranges), so I lean towards doing the same. As for sparseness, I suspect that _will_ be a typical case. It's not unreasonable to expect any given (multi-valued) field would only match a small number of the ranges. Anyway, I like the `BitSet` suggestion and will go ahead and update this change to use it. It'd be really interesting to run a micro-bechmark to see the impact. I'll see if I can carve out some time for that in the next couple days. We could also explore that in a follow-up Jira if you don't think that's blocking. I'd be curious to dig into that either way. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org