[GitHub] [lucene] mikemccand commented on a change in pull request #127: LUCENE-9946: Support multi-value fields in range facet counting

GitBox Tue, 25 May 2021 07:29:56 -0700


mikemccand commented on a change in pull request #127:
URL: https://github.com/apache/lucene/pull/127#discussion_r638843546




##########
File path: 
lucene/facet/src/java/org/apache/lucene/facet/range/OverlappingLongRangeCounter.java
##########
@@ -0,0 +1,364 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.range;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import org.apache.lucene.util.FixedBitSet;
+
+/**
+ * This implementation supports requested ranges that overlap. Because of 
this, we use a
+ * segment-tree to more efficiently aggregate counts into ranges at the end of 
processing. We also
+ * need to worry about double-counting issues since it's possible that 
multiple elementary segments,
+ * although mutually-exclusive, can roll-up to the same requested range. This 
creates some
+ * complexity with how we need to handle multi-valued documents.
+ */
+class OverlappingLongRangeCounter extends LongRangeCounter {
+
+  /** segment tree root node */
+  private final LongRangeNode root;
+  /** elementary segment boundaries used for efficient counting (bsearch to 
find interval) */
+  private final long[] boundaries;
+  /** whether-or-not there are leaf counts that still need to be rolled up at 
the end */
+  private boolean hasUnflushedCounts = false;
+
+  // Needed only for counting single-valued docs:
+  /** counts seen in each elementary interval leaf */
+  private int[] singleValuedLeafCounts;
+
+  // Needed only for counting multi-valued docs:
+  /** whether-or-not an elementary interval has seen at least one match for a 
single doc */
+  private FixedBitSet multiValuedDocLeafHits;
+  /** whether-or-not a range has seen at least one match for a single doc */
+  private FixedBitSet multiValuedDocRangeHits;
+
+  // Used during rollup
+  private int leafUpto;
+  /** number of counted documents that haven't matched any requested ranges */
+  private int missingCount = 0;
+
+  OverlappingLongRangeCounter(LongRange[] ranges, int[] countBuffer) {
+    super(countBuffer);
+
+    // Build elementary intervals:
+    List<InclusiveRange> elementaryIntervals = 
buildElementaryIntervals(ranges);
+
+    // Build binary tree on top of intervals:
+    root = split(0, elementaryIntervals.size(), elementaryIntervals);
+
+    // Set outputs, so we know which range to output for each node in the tree:
+    for (int i = 0; i < ranges.length; i++) {
+      root.addOutputs(i, ranges[i]);
+    }
+
+    // Keep track of elementary interval max boundaries for bsearch:
+    boundaries = new long[elementaryIntervals.size()];
+    for (int i = 0; i < boundaries.length; i++) {
+      boundaries[i] = elementaryIntervals.get(i).end;
+    }
+  }
+
+  @Override
+  void startMultiValuedDoc() {
+    super.startMultiValuedDoc();
+    // Lazy init a bitset to track the elementary segments we see of a 
multi-valued doc:
+    if (multiValuedDocLeafHits == null) {
+      multiValuedDocLeafHits = new FixedBitSet(boundaries.length);
+    } else {
+      multiValuedDocLeafHits.clear(0, multiValuedDocLeafHits.length());
+    }
+  }
+
+  @Override
+  boolean endMultiValuedDoc() {
+    assert multiValuedDocLeafHits != null : "must call startDoc() first";
+
+    // Short-circuit if the caller didn't specify any ranges to count:
+    if (rangeCount() == 0) {
+      return false;
+    }
+
+    // Do the rollup for this doc:
+    // Lazy init a bitset to track the requested ranges seen for this 
multi-valued doc:
+    if (multiValuedDocRangeHits == null) {
+      multiValuedDocRangeHits = new FixedBitSet(rangeCount());
+    } else {
+      multiValuedDocRangeHits.clear(0, multiValuedDocRangeHits.length());
+    }
+    leafUpto = 0;
+    rollupMultiValued(root);
+
+    // Actually increment the count for each matching range, and see if the 
doc contributed to
+    // at least one:
+    boolean docContributedToAtLeastOneRange = false;
+    for (int i = multiValuedDocRangeHits.nextSetBit(0); i < 
multiValuedDocRangeHits.length(); ) {
+      increment(i);
+      docContributedToAtLeastOneRange = true;
+      if (++i < multiValuedDocRangeHits.length()) {
+        i = multiValuedDocRangeHits.nextSetBit(i);
+      }
+    }
+
+    return docContributedToAtLeastOneRange;
+  }
+
+  @Override
+  int finish() {
+    if (hasUnflushedCounts) {
+      // Rollup any outstanding counts from single-valued cases:
+      missingCount = 0;
+      leafUpto = 0;
+      rollupSingleValued(root, false);
+
+      return missingCount;
+    } else {
+      return 0;
+    }
+  }
+
+  @Override
+  protected long[] boundaries() {
+    return boundaries;
+  }
+
+  @Override
+  protected void processSingleValuedHit(int elementarySegmentNum) {
+    // Lazy init:
+    if (singleValuedLeafCounts == null) {

Review comment:
       Ahh, OK, that's interesting.  So let's leave out those `assert`s I 
suggested, and maybe in the future we could mix/match this kind of 
optimization.  `SortedNumericDocValues` has a `.valueCount()` method that tells 
you per-hit how many values there are ... I guess we could optimize the 
`.valueCount() == 1` case then.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mikemccand commented on a change in pull request #127: LUCENE-9946: Support multi-value fields in range facet counting

Reply via email to