mikemccand commented on code in PR #13542: URL: https://github.com/apache/lucene/pull/13542#discussion_r1750413448
########## lucene/MIGRATE.md: ########## @@ -802,5 +802,35 @@ both `TopDocs` as well as facets results included in a reduced `FacetsCollector` ### `SearchWithCollectorTask` no longer supports the `collector.class` config parameter -`collector.class` used to allow users to load a custom collector implementation. `collector.manager.class` -replaces it by allowing users to load a custom collector manager instead. (Luca Cavanna) \ No newline at end of file +`collector.class` used to allow users to load a custom collector implementation. `collector.manager.class` +replaces it by allowing users to load a custom collector manager instead. (Luca Cavanna) + +### BulkScorer#score(LeafCollector collector, Bits acceptDocs) removed + +Use `BulkScorer#score(LeafCollector collector, Bits acceptDocs, int min, int max)` instead. In order to score the +entire leaf, provide `0` as min and `DocIdSetIterator.NO_MORE_DOCS` as max. `BulkScorer` subclasses that override +such method need to instead override the method variant that takes the range of doc ids as well as arguments. + +### CollectorManager#newCollector and Collector#getLeafCollector contract + +With the introduction of intra-segment query concurrency support, a `LeafCollector` for a given `LeafReaderContext` +instance may be requested multiple times as part of a single search call. This may only happen across separate +`Collector` instances returned by `CollectorManager#newCollector`, and never for the same `Collector` instance, given +that a slice can only ever hold a single partition of a given segment. Any logic or computation that needs to happen +once per segment requires specific handling in the collector manager implementation. +See `TotalHitCountCollectorManager` as an example. + +### Weight#scorer, Weight#bulkScorer and Weight#scorerSupplier contract + +With the introduction of intra-segment query concurrency support, a `Scorer`, `ScorerSupplier` or `BulkScorer` may +be requested multiple times for the same `LeafReaderContext` instance as part of a single search call. That +may happen concurrently from separate threads each searching a specific doc id range of the segment. +`Weight` implementations that rely on the assumption that a scorer, bulk scorer or scorer supplier for a given +`LeafReaderContext` is requested once per search need updating. + +### Signature of IndexSearcher#searchLeaf changed + +With the introduction of intra-segment query concurrency support, the `IndexSearcher#searchLeaf(LeafReaderContext, Weight, Collector)` +method accepts now two additional int arguments to identify the min and max doc id of the range that the leaf partition Review Comment: `accepts now` -> `now accepts`? ########## lucene/core/src/java/org/apache/lucene/search/CollectorManager.java: ########## @@ -31,6 +32,13 @@ * fully collected. * </ul> * + * <p><strong>Note:</strong> Separate {@link Collector} instances returned by {@link Review Comment: Maybe `Multiple LeafCollectors may be requested for the same LeafReaderContext via getLeafCollector` (with the nice `link`s included)? ########## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ########## @@ -363,7 +374,88 @@ public static LeafSlice[] slices( LeafSlice[] slices = new LeafSlice[groupedLeaves.size()]; int upto = 0; for (List<LeafReaderContext> currentLeaf : groupedLeaves) { - slices[upto] = new LeafSlice(currentLeaf); + slices[upto] = + new LeafSlice( + new ArrayList<>( + currentLeaf.stream() + .map(LeafReaderContextPartition::createForEntireSegment) + .toList())); + ++upto; + } + + return slices; + } + + /** + * Creates leaf slices that leverage intra-segment concurrency by splitting segments into multiple + * partitions according to the provided max number of documents per slice and maximum number of + * segments per slice. If a segment holds more documents than the provided max per slice, it gets + * split into equal size partitions that each gets its own slice assigned. Review Comment: Can a slice contain multiple partitions? (This impl doesn't do so right now, but it is allowed/possible?) Can a slice mix partitions and whole leaves? (I guess one could always take a whole leaf and switch it to a single partition `0..maxDoc`). ########## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ########## @@ -363,7 +374,88 @@ public static LeafSlice[] slices( LeafSlice[] slices = new LeafSlice[groupedLeaves.size()]; int upto = 0; for (List<LeafReaderContext> currentLeaf : groupedLeaves) { - slices[upto] = new LeafSlice(currentLeaf); + slices[upto] = + new LeafSlice( + new ArrayList<>( + currentLeaf.stream() + .map(LeafReaderContextPartition::createForEntireSegment) + .toList())); + ++upto; + } + + return slices; + } + + /** + * Creates leaf slices that leverage intra-segment concurrency by splitting segments into multiple + * partitions according to the provided max number of documents per slice and maximum number of + * segments per slice. If a segment holds more documents than the provided max per slice, it gets + * split into equal size partitions that each gets its own slice assigned. + * + * <p>Note: this method is not yet called by the default slicing implementation {@link + * #slices(List)}. Certain queries that require segment-level computation ahead of time duplicate + * this effort across segment partitions. Once that can be shared across partitions we can safely + * create partitions by default and perhaps refine the slicing approach implemented in this + * method. For this reason segment partitions are currently only created in tests. Users can call + * this method at their own risk. + * + * @lucene.experimental + */ + public static LeafSlice[] slicesWithPartitions( Review Comment: I think we should expect eventually to remove `slices` and replace it with this better implementation (once we resolve the performance penalty)? Maybe we rename the method to `slices` and take a `boolean allowPartitions` which defaults to `false` today until we fix the performance penalty? ########## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ########## @@ -363,7 +374,88 @@ public static LeafSlice[] slices( LeafSlice[] slices = new LeafSlice[groupedLeaves.size()]; int upto = 0; for (List<LeafReaderContext> currentLeaf : groupedLeaves) { - slices[upto] = new LeafSlice(currentLeaf); + slices[upto] = + new LeafSlice( + new ArrayList<>( + currentLeaf.stream() + .map(LeafReaderContextPartition::createForEntireSegment) + .toList())); + ++upto; + } + + return slices; + } + + /** + * Creates leaf slices that leverage intra-segment concurrency by splitting segments into multiple + * partitions according to the provided max number of documents per slice and maximum number of + * segments per slice. If a segment holds more documents than the provided max per slice, it gets + * split into equal size partitions that each gets its own slice assigned. + * + * <p>Note: this method is not yet called by the default slicing implementation {@link + * #slices(List)}. Certain queries that require segment-level computation ahead of time duplicate + * this effort across segment partitions. Once that can be shared across partitions we can safely + * create partitions by default and perhaps refine the slicing approach implemented in this Review Comment: Can we open an issue ("stop duplicating per-segment work when a segment is split into multiple partitions for intra-segment concurrency" or so) and reference it in this javadoc? ########## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ########## @@ -890,11 +1017,100 @@ public static class LeafSlice { * * @lucene.experimental */ - public final LeafReaderContext[] leaves; + public final LeafReaderContextPartition[] partitions; + + private final int maxDocs; + + public LeafSlice(List<LeafReaderContextPartition> leafReaderContextPartitions) { + Comparator<LeafReaderContextPartition> docBaseComparator = + Comparator.comparingInt(l -> l.ctx.docBase); + Comparator<LeafReaderContextPartition> minDocIdComparator = + Comparator.comparingInt(l -> l.minDocId); + leafReaderContextPartitions.sort(docBaseComparator.thenComparing(minDocIdComparator)); + this.partitions = leafReaderContextPartitions.toArray(new LeafReaderContextPartition[0]); + this.maxDocs = + Arrays.stream(partitions) + .map(LeafReaderContextPartition::getMaxDocs) + .reduce(Integer::sum) + .get(); + } - public LeafSlice(List<LeafReaderContext> leavesList) { - Collections.sort(leavesList, Comparator.comparingInt(l -> l.docBase)); - this.leaves = leavesList.toArray(new LeafReaderContext[0]); + /** + * Returns the total number of docs that a slice targets, by summing the number of docs that + * each of its leaf context partitions targets. + */ + public int getMaxDocs() { + return maxDocs; + } + } + + /** + * Holds information about a specific leaf context and the corresponding range of doc ids to + * search within. Used to optionally search across partitions of the same segment concurrently. + * + * <p>A partition instance can be created via {@link #createForEntireSegment(LeafReaderContext)}, + * in which case it will target the entire provided {@link LeafReaderContext}. A true partition of + * a segment can be created via {@link #createFromAndTo(LeafReaderContext, int, int)} providing + * the minimum doc id (including) to search as well as the max doc id (not including). Review Comment: `not including` -> `excluding` (matching the range expression pseudo-standard lol)? ########## lucene/MIGRATE.md: ########## @@ -802,5 +802,35 @@ both `TopDocs` as well as facets results included in a reduced `FacetsCollector` ### `SearchWithCollectorTask` no longer supports the `collector.class` config parameter -`collector.class` used to allow users to load a custom collector implementation. `collector.manager.class` -replaces it by allowing users to load a custom collector manager instead. (Luca Cavanna) \ No newline at end of file +`collector.class` used to allow users to load a custom collector implementation. `collector.manager.class` +replaces it by allowing users to load a custom collector manager instead. (Luca Cavanna) + +### BulkScorer#score(LeafCollector collector, Bits acceptDocs) removed + +Use `BulkScorer#score(LeafCollector collector, Bits acceptDocs, int min, int max)` instead. In order to score the +entire leaf, provide `0` as min and `DocIdSetIterator.NO_MORE_DOCS` as max. `BulkScorer` subclasses that override +such method need to instead override the method variant that takes the range of doc ids as well as arguments. + +### CollectorManager#newCollector and Collector#getLeafCollector contract + +With the introduction of intra-segment query concurrency support, a `LeafCollector` for a given `LeafReaderContext` +instance may be requested multiple times as part of a single search call. This may only happen across separate +`Collector` instances returned by `CollectorManager#newCollector`, and never for the same `Collector` instance, given +that a slice can only ever hold a single partition of a given segment. Any logic or computation that needs to happen +once per segment requires specific handling in the collector manager implementation. +See `TotalHitCountCollectorManager` as an example. + +### Weight#scorer, Weight#bulkScorer and Weight#scorerSupplier contract + +With the introduction of intra-segment query concurrency support, a `Scorer`, `ScorerSupplier` or `BulkScorer` may +be requested multiple times for the same `LeafReaderContext` instance as part of a single search call. That +may happen concurrently from separate threads each searching a specific doc id range of the segment. +`Weight` implementations that rely on the assumption that a scorer, bulk scorer or scorer supplier for a given +`LeafReaderContext` is requested once per search need updating. + +### Signature of IndexSearcher#searchLeaf changed + +With the introduction of intra-segment query concurrency support, the `IndexSearcher#searchLeaf(LeafReaderContext, Weight, Collector)` +method accepts now two additional int arguments to identify the min and max doc id of the range that the leaf partition Review Comment: And maybe `to identify the min/max range of docids that will be searched in this leaf partition`? ########## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ########## @@ -363,7 +374,88 @@ public static LeafSlice[] slices( LeafSlice[] slices = new LeafSlice[groupedLeaves.size()]; int upto = 0; for (List<LeafReaderContext> currentLeaf : groupedLeaves) { - slices[upto] = new LeafSlice(currentLeaf); + slices[upto] = + new LeafSlice( + new ArrayList<>( + currentLeaf.stream() + .map(LeafReaderContextPartition::createForEntireSegment) + .toList())); + ++upto; + } + + return slices; + } + + /** + * Creates leaf slices that leverage intra-segment concurrency by splitting segments into multiple + * partitions according to the provided max number of documents per slice and maximum number of + * segments per slice. If a segment holds more documents than the provided max per slice, it gets Review Comment: `gets` -> `is`? ########## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ########## @@ -363,7 +374,88 @@ public static LeafSlice[] slices( LeafSlice[] slices = new LeafSlice[groupedLeaves.size()]; int upto = 0; for (List<LeafReaderContext> currentLeaf : groupedLeaves) { - slices[upto] = new LeafSlice(currentLeaf); + slices[upto] = + new LeafSlice( + new ArrayList<>( + currentLeaf.stream() + .map(LeafReaderContextPartition::createForEntireSegment) + .toList())); + ++upto; + } + + return slices; + } + + /** + * Creates leaf slices that leverage intra-segment concurrency by splitting segments into multiple + * partitions according to the provided max number of documents per slice and maximum number of + * segments per slice. If a segment holds more documents than the provided max per slice, it gets + * split into equal size partitions that each gets its own slice assigned. + * + * <p>Note: this method is not yet called by the default slicing implementation {@link + * #slices(List)}. Certain queries that require segment-level computation ahead of time duplicate Review Comment: Maybe strengthen this a bit: `There is still a performance penalty for queries that require segment-level computation ahead of time, such as points/range and KNN queries. This is an implementation limitation that we expect to improve in future releases` or so? ########## lucene/core/src/java/org/apache/lucene/search/TotalHitCountCollectorManager.java: ########## @@ -28,17 +36,114 @@ */ public class TotalHitCountCollectorManager implements CollectorManager<TotalHitCountCollector, Integer> { + + private final boolean hasSegmentPartitions; + + /** + * Creates a new total hit count collector manager. The collectors returned by {@link + * #newCollector()} don't support intra-segment concurrency. Use the other constructor if segments + * partitions are being searched. + */ + public TotalHitCountCollectorManager() { + this(false); + } + + /** + * Creates a new total hit count collector manager, providing a flag that signals whether segment + * partitions are being searched, in which case the different collector need to share state to + * ensure consistent behaviour across partitions of the same segment. There are segment partitions + * when the {@link IndexSearcher#slices(List)} methods returns leaf slices that target leaf reader + * partitions. + * + * @see IndexSearcher#slices(List) + * @see org.apache.lucene.search.IndexSearcher.LeafReaderContextPartition + * @param hasSegmentPartitions whether the collector manager needs to create collectors that + * support searching segment partitions in parallel (via intra-segment search concurrency) + */ + public TotalHitCountCollectorManager(boolean hasSegmentPartitions) { Review Comment: Hmm ideally the user wouldn't have to pass this boolean, and this manager would instead detect itself if it sees multiple partitions? Will other collector managers also need the user to pass this boolean? (I know hit counting is a tricky case thanks to our sub-linear optimizations, yay). Is this maybe done as a (user visible) optimization to avoid using the "partition aware" counting? It'd be better if that opto lived under the hood, somehow ... Do we at least gracefully detect if the user makes a mistake and passes the wrong boolean here vs how the index was sliced/partitioned? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org