[GitHub] [lucene] mikemccand commented on a diff in pull request #12415: Optimize disjunction counts.

via GitHub Thu, 27 Jul 2023 03:48:17 -0700


mikemccand commented on code in PR #12415:
URL: https://github.com/apache/lucene/pull/12415#discussion_r1276073917



##########
lucene/core/src/java/org/apache/lucene/search/BooleanScorer.java:
##########
@@ -154,19 +207,20 @@ public void collect(int doc) throws IOException {
       throw new IllegalArgumentException(
           "This scorer can only be used with two scorers or more, got " + 
scorers.size());
     }
-    for (int i = 0; i < buckets.length; i++) {
-      buckets[i] = new Bucket();
+    if (needsScores || minShouldMatch > 1) {

Review Comment:
   Might this also optimize other cases, where we are using `BooleanScorer` in 
non-scoring cases (`MUST_NOT` or `FILTER`)?  Or do we never use `BooleanScorer` 
for these clauses and it's really just the `count` API that we are accelerating 
here?



##########
lucene/core/src/java/org/apache/lucene/search/BooleanScorer.java:
##########
@@ -109,7 +109,9 @@ public BulkScorerAndDoc get(int i) {
     }
   }
 
-  final Bucket[] buckets = new Bucket[SIZE];
+  // One bucket per doc ID in the window, non-null if scores are needed or if 
frequencies need to be
+  // counted
+  final Bucket[] buckets;

Review Comment:
   I wonder if switching this to parallel arrays, for maybe better CPU cache 
locality, would show any speedup (separate issue!).  Or maybe "structs" (value 
objects) when Java finally gets them.
   
   Though, the inlined `matching` bitset is sort of already a parallel array 
and maybe gets most of the gains.



##########
lucene/core/src/java/org/apache/lucene/search/DocIdStream.java:
##########
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+
+/**
+ * A stream of doc IDs. Most methods on {@link DocIdStream}s are terminal, 
meaning that the {@link
+ * DocIdStream} may not be further used.
+ *
+ * @see LeafCollector#collect(DocIdStream)
+ * @lucene.experimental
+ */
+public abstract class DocIdStream {

Review Comment:
   I'm not sure where to document this, but this stream is not in general 
(though could be) holding ALL matching hits for a given collection situation 
(query) right?  As used from `BooleanScorer` it is just one window's worth of 
hits (a 2048 chunk of docid space) at once?  I guess the right place to make 
this clear is in the new `collect(DocIdStream)` method?



##########
lucene/core/src/java/org/apache/lucene/search/BooleanScorer.java:
##########
@@ -154,19 +207,20 @@ public void collect(int doc) throws IOException {
       throw new IllegalArgumentException(
           "This scorer can only be used with two scorers or more, got " + 
scorers.size());
     }
-    for (int i = 0; i < buckets.length; i++) {
-      buckets[i] = new Bucket();
+    if (needsScores || minShouldMatch > 1) {
+      buckets = new Bucket[SIZE];
+      for (int i = 0; i < buckets.length; i++) {
+        buckets[i] = new Bucket();
+      }
+    } else {
+      buckets = null;
     }
     this.leads = new BulkScorerAndDoc[scorers.size()];
     this.head = new HeadPriorityQueue(scorers.size() - minShouldMatch + 1);
     this.tail = new TailPriorityQueue(minShouldMatch - 1);
     this.minShouldMatch = minShouldMatch;
+    this.needsScores = needsScores;
     for (BulkScorer scorer : scorers) {
-      if (needsScores == false) {
-        // OrCollector calls score() all the time so we have to explicitly
-        // disable scoring in order to avoid decoding useless norms
-        scorer = BooleanWeight.disableScoring(scorer);

Review Comment:
   Nice -- this change is a more effective way to disable scoring than wrapping 
in a no-op / fake scorer!



##########
lucene/core/src/java/org/apache/lucene/search/CheckedIntConsumer.java:
##########
@@ -0,0 +1,31 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.util.function.IntConsumer;
+
+/** Like {@link IntConsumer}, but may throw checked exceptions. */
+@FunctionalInterface
+public interface CheckedIntConsumer<T extends Exception> {

Review Comment:
   Darned ubiquitous `IOException` all throughout Lucene!!



##########
lucene/core/src/java/org/apache/lucene/search/LeafCollector.java:
##########
@@ -83,6 +83,13 @@ public interface LeafCollector {
    */
   void collect(int doc) throws IOException;
 
+  /**
+   * Bulk-collect doc IDs. The default implementation calls {@code 
stream.forEach(this::collect)}.

Review Comment:
   Can we note that this might be a chunk/window of docids, and it's always 
sequential/in-order with respect to other calls to collect (e.g. `collect(int 
doc)`).  Is it valid for a caller to mix & match calls to both `collect` 
methods here?  I would think so, but we are not yet doing that since this 
change will always collect with one or the other.



##########
lucene/core/src/java/org/apache/lucene/search/BooleanScorer.java:
##########
@@ -133,14 +136,64 @@ public void collect(int doc) throws IOException {
       final int i = doc & MASK;
       final int idx = i >>> 6;
       matching[idx] |= 1L << i;
-      final Bucket bucket = buckets[i];
-      bucket.freq++;
-      bucket.score += scorer.score();
+      if (buckets != null) {
+        final Bucket bucket = buckets[i];
+        bucket.freq++;
+        if (needsScores) {
+          bucket.score += scorer.score();
+        }
+      }
     }
   }
 
   final OrCollector orCollector = new OrCollector();
 
+  final class DocIdStreamView extends DocIdStream {
+
+    int base;
+
+    @Override
+    public void forEach(CheckedIntConsumer<IOException> consumer) throws 
IOException {
+      long[] matching = BooleanScorer.this.matching;
+      Bucket[] buckets = BooleanScorer.this.buckets;

Review Comment:
   We should (later!) maybe rename `Bucket` to `OneHit` or `DocHit` or so, to 
make it clear it represents details of a single doc hit.



##########
lucene/test-framework/src/java/org/apache/lucene/tests/search/QueryUtils.java:
##########
@@ -738,4 +740,66 @@ public void collect(int doc) throws IOException {
       }
     }
   }
+
+  /**
+   * Check that counting hits through {@link DocIdStream#count()} yield the 
same result as counting
+   * naively.
+   */
+  public static void checkCount(Query query, final IndexSearcher searcher) 
throws IOException {
+    query = searcher.rewrite(query);
+    Weight weight = searcher.createWeight(query, ScoreMode.COMPLETE_NO_SCORES, 
1);
+    for (LeafReaderContext context : searcher.getIndexReader().leaves()) {
+      BulkScorer scorer = weight.bulkScorer(context);
+      if (scorer == null) {
+        continue;
+      }
+      int[] expectedCount = {0};
+      boolean[] docIdStream = {false};
+      scorer.score(
+          new LeafCollector() {
+            @Override
+            public void collect(DocIdStream stream) throws IOException {
+              docIdStream[0] = true;
+              LeafCollector.super.collect(stream);

Review Comment:
   This then forwards to our `collect(int doc)` method below right?  So we are 
forcing counting "the slow way" (one by one).



##########
lucene/test-framework/src/java/org/apache/lucene/tests/search/QueryUtils.java:
##########
@@ -738,4 +740,66 @@ public void collect(int doc) throws IOException {
       }
     }
   }
+
+  /**
+   * Check that counting hits through {@link DocIdStream#count()} yield the 
same result as counting
+   * naively.
+   */
+  public static void checkCount(Query query, final IndexSearcher searcher) 
throws IOException {
+    query = searcher.rewrite(query);
+    Weight weight = searcher.createWeight(query, ScoreMode.COMPLETE_NO_SCORES, 
1);
+    for (LeafReaderContext context : searcher.getIndexReader().leaves()) {
+      BulkScorer scorer = weight.bulkScorer(context);
+      if (scorer == null) {
+        continue;
+      }
+      int[] expectedCount = {0};
+      boolean[] docIdStream = {false};
+      scorer.score(
+          new LeafCollector() {
+            @Override
+            public void collect(DocIdStream stream) throws IOException {
+              docIdStream[0] = true;
+              LeafCollector.super.collect(stream);
+            }
+
+            @Override
+            public void collect(int doc) throws IOException {
+              expectedCount[0]++;
+            }
+
+            @Override
+            public void setScorer(Scorable scorer) throws IOException {}
+          },
+          context.reader().getLiveDocs(),
+          0,
+          DocIdSetIterator.NO_MORE_DOCS);
+      if (docIdStream[0] == false) {
+        // Don't spend cycles running the query one more time, it doesn't use 
the DocIdStream
+        // optimization.
+        continue;
+      }
+      scorer = weight.bulkScorer(context);
+      int[] actualCount = {0};
+      scorer.score(
+          new LeafCollector() {
+            @Override
+            public void collect(DocIdStream stream) throws IOException {
+              actualCount[0] += stream.count();
+            }
+
+            @Override
+            public void collect(int doc) throws IOException {
+              actualCount[0]++;
+            }
+
+            @Override
+            public void setScorer(Scorable scorer) throws IOException {}
+          },
+          context.reader().getLiveDocs(),
+          0,
+          DocIdSetIterator.NO_MORE_DOCS);
+      assertEquals(expectedCount[0], actualCount[0]);

Review Comment:
   Nice!  So we count both fast (`DocIdStream#count`) and slow (one by one) way 
and confirm they agree.



##########
lucene/core/src/java/org/apache/lucene/search/BooleanScorer.java:
##########
@@ -119,6 +121,7 @@ public BulkScorerAndDoc get(int i) {
   final Score score = new Score();
   final int minShouldMatch;
   final long cost;
+  final boolean needsScores;
 
   final class OrCollector implements LeafCollector {

Review Comment:
   Do we really only use this `BooleanScorer` for pure disjunctive cases now?  
I wonder if it might be faster than BS2 for certain conjunctive cases, e.g. if 
the clauses all have "similar" cost.  (Separate issue).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mikemccand commented on a diff in pull request #12415: Optimize disjunction counts.

Reply via email to