mikemccand commented on code in PR #13542:
URL: https://github.com/apache/lucene/pull/13542#discussion_r1750413448
##########
lucene/MIGRATE.md:
##########
@@ -802,5 +802,35 @@ both `TopDocs` as well as facets results included in a
reduced `FacetsCollector`
### `SearchWithCollectorTask` no longer supports the `collector.class` config
parameter
-`collector.class` used to allow users to load a custom collector
implementation. `collector.manager.class`
-replaces it by allowing users to load a custom collector manager instead.
(Luca Cavanna)
\ No newline at end of file
+`collector.class` used to allow users to load a custom collector
implementation. `collector.manager.class`
+replaces it by allowing users to load a custom collector manager instead.
(Luca Cavanna)
+
+### BulkScorer#score(LeafCollector collector, Bits acceptDocs) removed
+
+Use `BulkScorer#score(LeafCollector collector, Bits acceptDocs, int min, int
max)` instead. In order to score the
+entire leaf, provide `0` as min and `DocIdSetIterator.NO_MORE_DOCS` as max.
`BulkScorer` subclasses that override
+such method need to instead override the method variant that takes the range
of doc ids as well as arguments.
+
+### CollectorManager#newCollector and Collector#getLeafCollector contract
+
+With the introduction of intra-segment query concurrency support, a
`LeafCollector` for a given `LeafReaderContext`
+instance may be requested multiple times as part of a single search call. This
may only happen across separate
+`Collector` instances returned by `CollectorManager#newCollector`, and never
for the same `Collector` instance, given
+that a slice can only ever hold a single partition of a given segment. Any
logic or computation that needs to happen
+once per segment requires specific handling in the collector manager
implementation.
+See `TotalHitCountCollectorManager` as an example.
+
+### Weight#scorer, Weight#bulkScorer and Weight#scorerSupplier contract
+
+With the introduction of intra-segment query concurrency support, a `Scorer`,
`ScorerSupplier` or `BulkScorer` may
+be requested multiple times for the same `LeafReaderContext` instance as part
of a single search call. That
+may happen concurrently from separate threads each searching a specific doc id
range of the segment.
+`Weight` implementations that rely on the assumption that a scorer, bulk
scorer or scorer supplier for a given
+`LeafReaderContext` is requested once per search need updating.
+
+### Signature of IndexSearcher#searchLeaf changed
+
+With the introduction of intra-segment query concurrency support, the
`IndexSearcher#searchLeaf(LeafReaderContext, Weight, Collector)`
+method accepts now two additional int arguments to identify the min and max
doc id of the range that the leaf partition
Review Comment:
`accepts now` -> `now accepts`?
##########
lucene/core/src/java/org/apache/lucene/search/CollectorManager.java:
##########
@@ -31,6 +32,13 @@
* fully collected.
* </ul>
*
+ * <p><strong>Note:</strong> Separate {@link Collector} instances returned by
{@link
Review Comment:
Maybe `Multiple LeafCollectors may be requested for the same
LeafReaderContext via getLeafCollector` (with the nice `link`s included)?
##########
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##########
@@ -363,7 +374,88 @@ public static LeafSlice[] slices(
LeafSlice[] slices = new LeafSlice[groupedLeaves.size()];
int upto = 0;
for (List<LeafReaderContext> currentLeaf : groupedLeaves) {
- slices[upto] = new LeafSlice(currentLeaf);
+ slices[upto] =
+ new LeafSlice(
+ new ArrayList<>(
+ currentLeaf.stream()
+ .map(LeafReaderContextPartition::createForEntireSegment)
+ .toList()));
+ ++upto;
+ }
+
+ return slices;
+ }
+
+ /**
+ * Creates leaf slices that leverage intra-segment concurrency by splitting
segments into multiple
+ * partitions according to the provided max number of documents per slice
and maximum number of
+ * segments per slice. If a segment holds more documents than the provided
max per slice, it gets
+ * split into equal size partitions that each gets its own slice assigned.
Review Comment:
Can a slice contain multiple partitions? (This impl doesn't do so right
now, but it is allowed/possible?) Can a slice mix partitions and whole leaves?
(I guess one could always take a whole leaf and switch it to a single
partition `0..maxDoc`).
##########
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##########
@@ -363,7 +374,88 @@ public static LeafSlice[] slices(
LeafSlice[] slices = new LeafSlice[groupedLeaves.size()];
int upto = 0;
for (List<LeafReaderContext> currentLeaf : groupedLeaves) {
- slices[upto] = new LeafSlice(currentLeaf);
+ slices[upto] =
+ new LeafSlice(
+ new ArrayList<>(
+ currentLeaf.stream()
+ .map(LeafReaderContextPartition::createForEntireSegment)
+ .toList()));
+ ++upto;
+ }
+
+ return slices;
+ }
+
+ /**
+ * Creates leaf slices that leverage intra-segment concurrency by splitting
segments into multiple
+ * partitions according to the provided max number of documents per slice
and maximum number of
+ * segments per slice. If a segment holds more documents than the provided
max per slice, it gets
+ * split into equal size partitions that each gets its own slice assigned.
+ *
+ * <p>Note: this method is not yet called by the default slicing
implementation {@link
+ * #slices(List)}. Certain queries that require segment-level computation
ahead of time duplicate
+ * this effort across segment partitions. Once that can be shared across
partitions we can safely
+ * create partitions by default and perhaps refine the slicing approach
implemented in this
+ * method. For this reason segment partitions are currently only created in
tests. Users can call
+ * this method at their own risk.
+ *
+ * @lucene.experimental
+ */
+ public static LeafSlice[] slicesWithPartitions(
Review Comment:
I think we should expect eventually to remove `slices` and replace it with
this better implementation (once we resolve the performance penalty)? Maybe we
rename the method to `slices` and take a `boolean allowPartitions` which
defaults to `false` today until we fix the performance penalty?
##########
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##########
@@ -363,7 +374,88 @@ public static LeafSlice[] slices(
LeafSlice[] slices = new LeafSlice[groupedLeaves.size()];
int upto = 0;
for (List<LeafReaderContext> currentLeaf : groupedLeaves) {
- slices[upto] = new LeafSlice(currentLeaf);
+ slices[upto] =
+ new LeafSlice(
+ new ArrayList<>(
+ currentLeaf.stream()
+ .map(LeafReaderContextPartition::createForEntireSegment)
+ .toList()));
+ ++upto;
+ }
+
+ return slices;
+ }
+
+ /**
+ * Creates leaf slices that leverage intra-segment concurrency by splitting
segments into multiple
+ * partitions according to the provided max number of documents per slice
and maximum number of
+ * segments per slice. If a segment holds more documents than the provided
max per slice, it gets
+ * split into equal size partitions that each gets its own slice assigned.
+ *
+ * <p>Note: this method is not yet called by the default slicing
implementation {@link
+ * #slices(List)}. Certain queries that require segment-level computation
ahead of time duplicate
+ * this effort across segment partitions. Once that can be shared across
partitions we can safely
+ * create partitions by default and perhaps refine the slicing approach
implemented in this
Review Comment:
Can we open an issue ("stop duplicating per-segment work when a segment is
split into multiple partitions for intra-segment concurrency" or so) and
reference it in this javadoc?
##########
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##########
@@ -890,11 +1017,100 @@ public static class LeafSlice {
*
* @lucene.experimental
*/
- public final LeafReaderContext[] leaves;
+ public final LeafReaderContextPartition[] partitions;
+
+ private final int maxDocs;
+
+ public LeafSlice(List<LeafReaderContextPartition>
leafReaderContextPartitions) {
+ Comparator<LeafReaderContextPartition> docBaseComparator =
+ Comparator.comparingInt(l -> l.ctx.docBase);
+ Comparator<LeafReaderContextPartition> minDocIdComparator =
+ Comparator.comparingInt(l -> l.minDocId);
+
leafReaderContextPartitions.sort(docBaseComparator.thenComparing(minDocIdComparator));
+ this.partitions = leafReaderContextPartitions.toArray(new
LeafReaderContextPartition[0]);
+ this.maxDocs =
+ Arrays.stream(partitions)
+ .map(LeafReaderContextPartition::getMaxDocs)
+ .reduce(Integer::sum)
+ .get();
+ }
- public LeafSlice(List<LeafReaderContext> leavesList) {
- Collections.sort(leavesList, Comparator.comparingInt(l -> l.docBase));
- this.leaves = leavesList.toArray(new LeafReaderContext[0]);
+ /**
+ * Returns the total number of docs that a slice targets, by summing the
number of docs that
+ * each of its leaf context partitions targets.
+ */
+ public int getMaxDocs() {
+ return maxDocs;
+ }
+ }
+
+ /**
+ * Holds information about a specific leaf context and the corresponding
range of doc ids to
+ * search within. Used to optionally search across partitions of the same
segment concurrently.
+ *
+ * <p>A partition instance can be created via {@link
#createForEntireSegment(LeafReaderContext)},
+ * in which case it will target the entire provided {@link
LeafReaderContext}. A true partition of
+ * a segment can be created via {@link #createFromAndTo(LeafReaderContext,
int, int)} providing
+ * the minimum doc id (including) to search as well as the max doc id (not
including).
Review Comment:
`not including` -> `excluding` (matching the range expression
pseudo-standard lol)?
##########
lucene/MIGRATE.md:
##########
@@ -802,5 +802,35 @@ both `TopDocs` as well as facets results included in a
reduced `FacetsCollector`
### `SearchWithCollectorTask` no longer supports the `collector.class` config
parameter
-`collector.class` used to allow users to load a custom collector
implementation. `collector.manager.class`
-replaces it by allowing users to load a custom collector manager instead.
(Luca Cavanna)
\ No newline at end of file
+`collector.class` used to allow users to load a custom collector
implementation. `collector.manager.class`
+replaces it by allowing users to load a custom collector manager instead.
(Luca Cavanna)
+
+### BulkScorer#score(LeafCollector collector, Bits acceptDocs) removed
+
+Use `BulkScorer#score(LeafCollector collector, Bits acceptDocs, int min, int
max)` instead. In order to score the
+entire leaf, provide `0` as min and `DocIdSetIterator.NO_MORE_DOCS` as max.
`BulkScorer` subclasses that override
+such method need to instead override the method variant that takes the range
of doc ids as well as arguments.
+
+### CollectorManager#newCollector and Collector#getLeafCollector contract
+
+With the introduction of intra-segment query concurrency support, a
`LeafCollector` for a given `LeafReaderContext`
+instance may be requested multiple times as part of a single search call. This
may only happen across separate
+`Collector` instances returned by `CollectorManager#newCollector`, and never
for the same `Collector` instance, given
+that a slice can only ever hold a single partition of a given segment. Any
logic or computation that needs to happen
+once per segment requires specific handling in the collector manager
implementation.
+See `TotalHitCountCollectorManager` as an example.
+
+### Weight#scorer, Weight#bulkScorer and Weight#scorerSupplier contract
+
+With the introduction of intra-segment query concurrency support, a `Scorer`,
`ScorerSupplier` or `BulkScorer` may
+be requested multiple times for the same `LeafReaderContext` instance as part
of a single search call. That
+may happen concurrently from separate threads each searching a specific doc id
range of the segment.
+`Weight` implementations that rely on the assumption that a scorer, bulk
scorer or scorer supplier for a given
+`LeafReaderContext` is requested once per search need updating.
+
+### Signature of IndexSearcher#searchLeaf changed
+
+With the introduction of intra-segment query concurrency support, the
`IndexSearcher#searchLeaf(LeafReaderContext, Weight, Collector)`
+method accepts now two additional int arguments to identify the min and max
doc id of the range that the leaf partition
Review Comment:
And maybe `to identify the min/max range of docids that will be searched in
this leaf partition`?
##########
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##########
@@ -363,7 +374,88 @@ public static LeafSlice[] slices(
LeafSlice[] slices = new LeafSlice[groupedLeaves.size()];
int upto = 0;
for (List<LeafReaderContext> currentLeaf : groupedLeaves) {
- slices[upto] = new LeafSlice(currentLeaf);
+ slices[upto] =
+ new LeafSlice(
+ new ArrayList<>(
+ currentLeaf.stream()
+ .map(LeafReaderContextPartition::createForEntireSegment)
+ .toList()));
+ ++upto;
+ }
+
+ return slices;
+ }
+
+ /**
+ * Creates leaf slices that leverage intra-segment concurrency by splitting
segments into multiple
+ * partitions according to the provided max number of documents per slice
and maximum number of
+ * segments per slice. If a segment holds more documents than the provided
max per slice, it gets
Review Comment:
`gets` -> `is`?
##########
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##########
@@ -363,7 +374,88 @@ public static LeafSlice[] slices(
LeafSlice[] slices = new LeafSlice[groupedLeaves.size()];
int upto = 0;
for (List<LeafReaderContext> currentLeaf : groupedLeaves) {
- slices[upto] = new LeafSlice(currentLeaf);
+ slices[upto] =
+ new LeafSlice(
+ new ArrayList<>(
+ currentLeaf.stream()
+ .map(LeafReaderContextPartition::createForEntireSegment)
+ .toList()));
+ ++upto;
+ }
+
+ return slices;
+ }
+
+ /**
+ * Creates leaf slices that leverage intra-segment concurrency by splitting
segments into multiple
+ * partitions according to the provided max number of documents per slice
and maximum number of
+ * segments per slice. If a segment holds more documents than the provided
max per slice, it gets
+ * split into equal size partitions that each gets its own slice assigned.
+ *
+ * <p>Note: this method is not yet called by the default slicing
implementation {@link
+ * #slices(List)}. Certain queries that require segment-level computation
ahead of time duplicate
Review Comment:
Maybe strengthen this a bit: `There is still a performance penalty for
queries that require segment-level computation ahead of time, such as
points/range and KNN queries. This is an implementation limitation that we
expect to improve in future releases` or so?
##########
lucene/core/src/java/org/apache/lucene/search/TotalHitCountCollectorManager.java:
##########
@@ -28,17 +36,114 @@
*/
public class TotalHitCountCollectorManager
implements CollectorManager<TotalHitCountCollector, Integer> {
+
+ private final boolean hasSegmentPartitions;
+
+ /**
+ * Creates a new total hit count collector manager. The collectors returned
by {@link
+ * #newCollector()} don't support intra-segment concurrency. Use the other
constructor if segments
+ * partitions are being searched.
+ */
+ public TotalHitCountCollectorManager() {
+ this(false);
+ }
+
+ /**
+ * Creates a new total hit count collector manager, providing a flag that
signals whether segment
+ * partitions are being searched, in which case the different collector need
to share state to
+ * ensure consistent behaviour across partitions of the same segment. There
are segment partitions
+ * when the {@link IndexSearcher#slices(List)} methods returns leaf slices
that target leaf reader
+ * partitions.
+ *
+ * @see IndexSearcher#slices(List)
+ * @see org.apache.lucene.search.IndexSearcher.LeafReaderContextPartition
+ * @param hasSegmentPartitions whether the collector manager needs to create
collectors that
+ * support searching segment partitions in parallel (via intra-segment
search concurrency)
+ */
+ public TotalHitCountCollectorManager(boolean hasSegmentPartitions) {
Review Comment:
Hmm ideally the user wouldn't have to pass this boolean, and this manager
would instead detect itself if it sees multiple partitions? Will other
collector managers also need the user to pass this boolean? (I know hit
counting is a tricky case thanks to our sub-linear optimizations, yay).
Is this maybe done as a (user visible) optimization to avoid using the
"partition aware" counting? It'd be better if that opto lived under the hood,
somehow ...
Do we at least gracefully detect if the user makes a mistake and passes the
wrong boolean here vs how the index was sliced/partitioned?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]