mikemccand commented on code in PR #12415: URL: https://github.com/apache/lucene/pull/12415#discussion_r1276073917
########## lucene/core/src/java/org/apache/lucene/search/BooleanScorer.java: ########## @@ -154,19 +207,20 @@ public void collect(int doc) throws IOException { throw new IllegalArgumentException( "This scorer can only be used with two scorers or more, got " + scorers.size()); } - for (int i = 0; i < buckets.length; i++) { - buckets[i] = new Bucket(); + if (needsScores || minShouldMatch > 1) { Review Comment: Might this also optimize other cases, where we are using `BooleanScorer` in non-scoring cases (`MUST_NOT` or `FILTER`)? Or do we never use `BooleanScorer` for these clauses and it's really just the `count` API that we are accelerating here? ########## lucene/core/src/java/org/apache/lucene/search/BooleanScorer.java: ########## @@ -109,7 +109,9 @@ public BulkScorerAndDoc get(int i) { } } - final Bucket[] buckets = new Bucket[SIZE]; + // One bucket per doc ID in the window, non-null if scores are needed or if frequencies need to be + // counted + final Bucket[] buckets; Review Comment: I wonder if switching this to parallel arrays, for maybe better CPU cache locality, would show any speedup (separate issue!). Or maybe "structs" (value objects) when Java finally gets them. Though, the inlined `matching` bitset is sort of already a parallel array and maybe gets most of the gains. ########## lucene/core/src/java/org/apache/lucene/search/DocIdStream.java: ########## @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +import java.io.IOException; + +/** + * A stream of doc IDs. Most methods on {@link DocIdStream}s are terminal, meaning that the {@link + * DocIdStream} may not be further used. + * + * @see LeafCollector#collect(DocIdStream) + * @lucene.experimental + */ +public abstract class DocIdStream { Review Comment: I'm not sure where to document this, but this stream is not in general (though could be) holding ALL matching hits for a given collection situation (query) right? As used from `BooleanScorer` it is just one window's worth of hits (a 2048 chunk of docid space) at once? I guess the right place to make this clear is in the new `collect(DocIdStream)` method? ########## lucene/core/src/java/org/apache/lucene/search/BooleanScorer.java: ########## @@ -154,19 +207,20 @@ public void collect(int doc) throws IOException { throw new IllegalArgumentException( "This scorer can only be used with two scorers or more, got " + scorers.size()); } - for (int i = 0; i < buckets.length; i++) { - buckets[i] = new Bucket(); + if (needsScores || minShouldMatch > 1) { + buckets = new Bucket[SIZE]; + for (int i = 0; i < buckets.length; i++) { + buckets[i] = new Bucket(); + } + } else { + buckets = null; } this.leads = new BulkScorerAndDoc[scorers.size()]; this.head = new HeadPriorityQueue(scorers.size() - minShouldMatch + 1); this.tail = new TailPriorityQueue(minShouldMatch - 1); this.minShouldMatch = minShouldMatch; + this.needsScores = needsScores; for (BulkScorer scorer : scorers) { - if (needsScores == false) { - // OrCollector calls score() all the time so we have to explicitly - // disable scoring in order to avoid decoding useless norms - scorer = BooleanWeight.disableScoring(scorer); Review Comment: Nice -- this change is a more effective way to disable scoring than wrapping in a no-op / fake scorer! ########## lucene/core/src/java/org/apache/lucene/search/CheckedIntConsumer.java: ########## @@ -0,0 +1,31 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +import java.util.function.IntConsumer; + +/** Like {@link IntConsumer}, but may throw checked exceptions. */ +@FunctionalInterface +public interface CheckedIntConsumer<T extends Exception> { Review Comment: Darned ubiquitous `IOException` all throughout Lucene!! ########## lucene/core/src/java/org/apache/lucene/search/LeafCollector.java: ########## @@ -83,6 +83,13 @@ public interface LeafCollector { */ void collect(int doc) throws IOException; + /** + * Bulk-collect doc IDs. The default implementation calls {@code stream.forEach(this::collect)}. Review Comment: Can we note that this might be a chunk/window of docids, and it's always sequential/in-order with respect to other calls to collect (e.g. `collect(int doc)`). Is it valid for a caller to mix & match calls to both `collect` methods here? I would think so, but we are not yet doing that since this change will always collect with one or the other. ########## lucene/core/src/java/org/apache/lucene/search/BooleanScorer.java: ########## @@ -133,14 +136,64 @@ public void collect(int doc) throws IOException { final int i = doc & MASK; final int idx = i >>> 6; matching[idx] |= 1L << i; - final Bucket bucket = buckets[i]; - bucket.freq++; - bucket.score += scorer.score(); + if (buckets != null) { + final Bucket bucket = buckets[i]; + bucket.freq++; + if (needsScores) { + bucket.score += scorer.score(); + } + } } } final OrCollector orCollector = new OrCollector(); + final class DocIdStreamView extends DocIdStream { + + int base; + + @Override + public void forEach(CheckedIntConsumer<IOException> consumer) throws IOException { + long[] matching = BooleanScorer.this.matching; + Bucket[] buckets = BooleanScorer.this.buckets; Review Comment: We should (later!) maybe rename `Bucket` to `OneHit` or `DocHit` or so, to make it clear it represents details of a single doc hit. ########## lucene/test-framework/src/java/org/apache/lucene/tests/search/QueryUtils.java: ########## @@ -738,4 +740,66 @@ public void collect(int doc) throws IOException { } } } + + /** + * Check that counting hits through {@link DocIdStream#count()} yield the same result as counting + * naively. + */ + public static void checkCount(Query query, final IndexSearcher searcher) throws IOException { + query = searcher.rewrite(query); + Weight weight = searcher.createWeight(query, ScoreMode.COMPLETE_NO_SCORES, 1); + for (LeafReaderContext context : searcher.getIndexReader().leaves()) { + BulkScorer scorer = weight.bulkScorer(context); + if (scorer == null) { + continue; + } + int[] expectedCount = {0}; + boolean[] docIdStream = {false}; + scorer.score( + new LeafCollector() { + @Override + public void collect(DocIdStream stream) throws IOException { + docIdStream[0] = true; + LeafCollector.super.collect(stream); Review Comment: This then forwards to our `collect(int doc)` method below right? So we are forcing counting "the slow way" (one by one). ########## lucene/test-framework/src/java/org/apache/lucene/tests/search/QueryUtils.java: ########## @@ -738,4 +740,66 @@ public void collect(int doc) throws IOException { } } } + + /** + * Check that counting hits through {@link DocIdStream#count()} yield the same result as counting + * naively. + */ + public static void checkCount(Query query, final IndexSearcher searcher) throws IOException { + query = searcher.rewrite(query); + Weight weight = searcher.createWeight(query, ScoreMode.COMPLETE_NO_SCORES, 1); + for (LeafReaderContext context : searcher.getIndexReader().leaves()) { + BulkScorer scorer = weight.bulkScorer(context); + if (scorer == null) { + continue; + } + int[] expectedCount = {0}; + boolean[] docIdStream = {false}; + scorer.score( + new LeafCollector() { + @Override + public void collect(DocIdStream stream) throws IOException { + docIdStream[0] = true; + LeafCollector.super.collect(stream); + } + + @Override + public void collect(int doc) throws IOException { + expectedCount[0]++; + } + + @Override + public void setScorer(Scorable scorer) throws IOException {} + }, + context.reader().getLiveDocs(), + 0, + DocIdSetIterator.NO_MORE_DOCS); + if (docIdStream[0] == false) { + // Don't spend cycles running the query one more time, it doesn't use the DocIdStream + // optimization. + continue; + } + scorer = weight.bulkScorer(context); + int[] actualCount = {0}; + scorer.score( + new LeafCollector() { + @Override + public void collect(DocIdStream stream) throws IOException { + actualCount[0] += stream.count(); + } + + @Override + public void collect(int doc) throws IOException { + actualCount[0]++; + } + + @Override + public void setScorer(Scorable scorer) throws IOException {} + }, + context.reader().getLiveDocs(), + 0, + DocIdSetIterator.NO_MORE_DOCS); + assertEquals(expectedCount[0], actualCount[0]); Review Comment: Nice! So we count both fast (`DocIdStream#count`) and slow (one by one) way and confirm they agree. ########## lucene/core/src/java/org/apache/lucene/search/BooleanScorer.java: ########## @@ -119,6 +121,7 @@ public BulkScorerAndDoc get(int i) { final Score score = new Score(); final int minShouldMatch; final long cost; + final boolean needsScores; final class OrCollector implements LeafCollector { Review Comment: Do we really only use this `BooleanScorer` for pure disjunctive cases now? I wonder if it might be faster than BS2 for certain conjunctive cases, e.g. if the clauses all have "similar" cost. (Separate issue). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org