Re: [PR] Add support for intra-segment search concurrency [lucene]

via GitHub Mon, 09 Sep 2024 08:30:00 -0700


mikemccand commented on code in PR #13542:
URL: https://github.com/apache/lucene/pull/13542#discussion_r1750413448



##########
lucene/MIGRATE.md:
##########
@@ -802,5 +802,35 @@ both `TopDocs` as well as facets results included in a 
reduced `FacetsCollector`
 
 ### `SearchWithCollectorTask` no longer supports the `collector.class` config 
parameter 
 
-`collector.class` used to allow users to load a custom collector 
implementation. `collector.manager.class` 
-replaces it by allowing users to load a custom collector manager instead. 
(Luca Cavanna)
\ No newline at end of file
+`collector.class` used to allow users to load a custom collector 
implementation. `collector.manager.class`
+replaces it by allowing users to load a custom collector manager instead. 
(Luca Cavanna)
+
+### BulkScorer#score(LeafCollector collector, Bits acceptDocs) removed
+
+Use `BulkScorer#score(LeafCollector collector, Bits acceptDocs, int min, int 
max)` instead. In order to score the 
+entire leaf, provide `0` as min and `DocIdSetIterator.NO_MORE_DOCS` as max. 
`BulkScorer` subclasses that override 
+such method need to instead override the method variant that takes the range 
of doc ids as well as arguments.
+
+### CollectorManager#newCollector and Collector#getLeafCollector contract
+
+With the introduction of intra-segment query concurrency support, a 
`LeafCollector` for a given `LeafReaderContext` 
+instance may be requested multiple times as part of a single search call. This 
may only happen across separate 
+`Collector` instances returned by `CollectorManager#newCollector`, and never 
for the same `Collector` instance, given 
+that a slice can only ever hold a single partition of a given segment. Any 
logic or computation that needs to happen
+once per segment requires specific handling in the collector manager 
implementation.
+See `TotalHitCountCollectorManager` as an example.
+
+### Weight#scorer, Weight#bulkScorer and Weight#scorerSupplier contract
+
+With the introduction of intra-segment query concurrency support, a `Scorer`, 
`ScorerSupplier` or `BulkScorer` may 
+be requested multiple times for the same `LeafReaderContext` instance as part 
of a single search call. That 
+may happen concurrently from separate threads each searching a specific doc id 
range of the segment. 
+`Weight` implementations that rely on the assumption that a scorer, bulk 
scorer or scorer supplier for a given 
+`LeafReaderContext` is requested once per search need updating.
+
+### Signature of IndexSearcher#searchLeaf changed
+
+With the introduction of intra-segment query concurrency support, the 
`IndexSearcher#searchLeaf(LeafReaderContext, Weight, Collector)` 
+method accepts now two additional int arguments to identify the min and max 
doc id of the range that the leaf partition 

Review Comment:
   `accepts now` -> `now accepts`?



##########
lucene/core/src/java/org/apache/lucene/search/CollectorManager.java:
##########
@@ -31,6 +32,13 @@
  *       fully collected.
  * </ul>
  *
+ * <p><strong>Note:</strong> Separate {@link Collector} instances returned by 
{@link

Review Comment:
   Maybe `Multiple LeafCollectors may be requested for the same 
LeafReaderContext via getLeafCollector` (with the nice `link`s included)?



##########
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##########
@@ -363,7 +374,88 @@ public static LeafSlice[] slices(
     LeafSlice[] slices = new LeafSlice[groupedLeaves.size()];
     int upto = 0;
     for (List<LeafReaderContext> currentLeaf : groupedLeaves) {
-      slices[upto] = new LeafSlice(currentLeaf);
+      slices[upto] =
+          new LeafSlice(
+              new ArrayList<>(
+                  currentLeaf.stream()
+                      .map(LeafReaderContextPartition::createForEntireSegment)
+                      .toList()));
+      ++upto;
+    }
+
+    return slices;
+  }
+
+  /**
+   * Creates leaf slices that leverage intra-segment concurrency by splitting 
segments into multiple
+   * partitions according to the provided max number of documents per slice 
and maximum number of
+   * segments per slice. If a segment holds more documents than the provided 
max per slice, it gets
+   * split into equal size partitions that each gets its own slice assigned.

Review Comment:
   Can a slice contain multiple partitions?  (This impl doesn't do so right 
now, but it is allowed/possible?)  Can a slice mix partitions and whole leaves? 
 (I guess one could always take a whole leaf and switch it to a single 
partition `0..maxDoc`).



##########
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##########
@@ -363,7 +374,88 @@ public static LeafSlice[] slices(
     LeafSlice[] slices = new LeafSlice[groupedLeaves.size()];
     int upto = 0;
     for (List<LeafReaderContext> currentLeaf : groupedLeaves) {
-      slices[upto] = new LeafSlice(currentLeaf);
+      slices[upto] =
+          new LeafSlice(
+              new ArrayList<>(
+                  currentLeaf.stream()
+                      .map(LeafReaderContextPartition::createForEntireSegment)
+                      .toList()));
+      ++upto;
+    }
+
+    return slices;
+  }
+
+  /**
+   * Creates leaf slices that leverage intra-segment concurrency by splitting 
segments into multiple
+   * partitions according to the provided max number of documents per slice 
and maximum number of
+   * segments per slice. If a segment holds more documents than the provided 
max per slice, it gets
+   * split into equal size partitions that each gets its own slice assigned.
+   *
+   * <p>Note: this method is not yet called by the default slicing 
implementation {@link
+   * #slices(List)}. Certain queries that require segment-level computation 
ahead of time duplicate
+   * this effort across segment partitions. Once that can be shared across 
partitions we can safely
+   * create partitions by default and perhaps refine the slicing approach 
implemented in this
+   * method. For this reason segment partitions are currently only created in 
tests. Users can call
+   * this method at their own risk.
+   *
+   * @lucene.experimental
+   */
+  public static LeafSlice[] slicesWithPartitions(

Review Comment:
   I think we should expect eventually to remove `slices` and replace it with 
this better implementation (once we resolve the performance penalty)?  Maybe we 
rename the method to `slices` and take a `boolean allowPartitions` which 
defaults to `false` today until we fix the performance penalty?



##########
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##########
@@ -363,7 +374,88 @@ public static LeafSlice[] slices(
     LeafSlice[] slices = new LeafSlice[groupedLeaves.size()];
     int upto = 0;
     for (List<LeafReaderContext> currentLeaf : groupedLeaves) {
-      slices[upto] = new LeafSlice(currentLeaf);
+      slices[upto] =
+          new LeafSlice(
+              new ArrayList<>(
+                  currentLeaf.stream()
+                      .map(LeafReaderContextPartition::createForEntireSegment)
+                      .toList()));
+      ++upto;
+    }
+
+    return slices;
+  }
+
+  /**
+   * Creates leaf slices that leverage intra-segment concurrency by splitting 
segments into multiple
+   * partitions according to the provided max number of documents per slice 
and maximum number of
+   * segments per slice. If a segment holds more documents than the provided 
max per slice, it gets
+   * split into equal size partitions that each gets its own slice assigned.
+   *
+   * <p>Note: this method is not yet called by the default slicing 
implementation {@link
+   * #slices(List)}. Certain queries that require segment-level computation 
ahead of time duplicate
+   * this effort across segment partitions. Once that can be shared across 
partitions we can safely
+   * create partitions by default and perhaps refine the slicing approach 
implemented in this

Review Comment:
   Can we open an issue ("stop duplicating per-segment work when a segment is 
split into multiple partitions for intra-segment concurrency" or so) and 
reference it in this javadoc?



##########
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##########
@@ -890,11 +1017,100 @@ public static class LeafSlice {
      *
      * @lucene.experimental
      */
-    public final LeafReaderContext[] leaves;
+    public final LeafReaderContextPartition[] partitions;
+
+    private final int maxDocs;
+
+    public LeafSlice(List<LeafReaderContextPartition> 
leafReaderContextPartitions) {
+      Comparator<LeafReaderContextPartition> docBaseComparator =
+          Comparator.comparingInt(l -> l.ctx.docBase);
+      Comparator<LeafReaderContextPartition> minDocIdComparator =
+          Comparator.comparingInt(l -> l.minDocId);
+      
leafReaderContextPartitions.sort(docBaseComparator.thenComparing(minDocIdComparator));
+      this.partitions = leafReaderContextPartitions.toArray(new 
LeafReaderContextPartition[0]);
+      this.maxDocs =
+          Arrays.stream(partitions)
+              .map(LeafReaderContextPartition::getMaxDocs)
+              .reduce(Integer::sum)
+              .get();
+    }
 
-    public LeafSlice(List<LeafReaderContext> leavesList) {
-      Collections.sort(leavesList, Comparator.comparingInt(l -> l.docBase));
-      this.leaves = leavesList.toArray(new LeafReaderContext[0]);
+    /**
+     * Returns the total number of docs that a slice targets, by summing the 
number of docs that
+     * each of its leaf context partitions targets.
+     */
+    public int getMaxDocs() {
+      return maxDocs;
+    }
+  }
+
+  /**
+   * Holds information about a specific leaf context and the corresponding 
range of doc ids to
+   * search within. Used to optionally search across partitions of the same 
segment concurrently.
+   *
+   * <p>A partition instance can be created via {@link 
#createForEntireSegment(LeafReaderContext)},
+   * in which case it will target the entire provided {@link 
LeafReaderContext}. A true partition of
+   * a segment can be created via {@link #createFromAndTo(LeafReaderContext, 
int, int)} providing
+   * the minimum doc id (including) to search as well as the max doc id (not 
including).

Review Comment:
   `not including` -> `excluding` (matching the range expression 
pseudo-standard lol)?



##########
lucene/MIGRATE.md:
##########
@@ -802,5 +802,35 @@ both `TopDocs` as well as facets results included in a 
reduced `FacetsCollector`
 
 ### `SearchWithCollectorTask` no longer supports the `collector.class` config 
parameter 
 
-`collector.class` used to allow users to load a custom collector 
implementation. `collector.manager.class` 
-replaces it by allowing users to load a custom collector manager instead. 
(Luca Cavanna)
\ No newline at end of file
+`collector.class` used to allow users to load a custom collector 
implementation. `collector.manager.class`
+replaces it by allowing users to load a custom collector manager instead. 
(Luca Cavanna)
+
+### BulkScorer#score(LeafCollector collector, Bits acceptDocs) removed
+
+Use `BulkScorer#score(LeafCollector collector, Bits acceptDocs, int min, int 
max)` instead. In order to score the 
+entire leaf, provide `0` as min and `DocIdSetIterator.NO_MORE_DOCS` as max. 
`BulkScorer` subclasses that override 
+such method need to instead override the method variant that takes the range 
of doc ids as well as arguments.
+
+### CollectorManager#newCollector and Collector#getLeafCollector contract
+
+With the introduction of intra-segment query concurrency support, a 
`LeafCollector` for a given `LeafReaderContext` 
+instance may be requested multiple times as part of a single search call. This 
may only happen across separate 
+`Collector` instances returned by `CollectorManager#newCollector`, and never 
for the same `Collector` instance, given 
+that a slice can only ever hold a single partition of a given segment. Any 
logic or computation that needs to happen
+once per segment requires specific handling in the collector manager 
implementation.
+See `TotalHitCountCollectorManager` as an example.
+
+### Weight#scorer, Weight#bulkScorer and Weight#scorerSupplier contract
+
+With the introduction of intra-segment query concurrency support, a `Scorer`, 
`ScorerSupplier` or `BulkScorer` may 
+be requested multiple times for the same `LeafReaderContext` instance as part 
of a single search call. That 
+may happen concurrently from separate threads each searching a specific doc id 
range of the segment. 
+`Weight` implementations that rely on the assumption that a scorer, bulk 
scorer or scorer supplier for a given 
+`LeafReaderContext` is requested once per search need updating.
+
+### Signature of IndexSearcher#searchLeaf changed
+
+With the introduction of intra-segment query concurrency support, the 
`IndexSearcher#searchLeaf(LeafReaderContext, Weight, Collector)` 
+method accepts now two additional int arguments to identify the min and max 
doc id of the range that the leaf partition 

Review Comment:
   And maybe `to identify the min/max range of docids that will be searched in 
this leaf partition`?



##########
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##########
@@ -363,7 +374,88 @@ public static LeafSlice[] slices(
     LeafSlice[] slices = new LeafSlice[groupedLeaves.size()];
     int upto = 0;
     for (List<LeafReaderContext> currentLeaf : groupedLeaves) {
-      slices[upto] = new LeafSlice(currentLeaf);
+      slices[upto] =
+          new LeafSlice(
+              new ArrayList<>(
+                  currentLeaf.stream()
+                      .map(LeafReaderContextPartition::createForEntireSegment)
+                      .toList()));
+      ++upto;
+    }
+
+    return slices;
+  }
+
+  /**
+   * Creates leaf slices that leverage intra-segment concurrency by splitting 
segments into multiple
+   * partitions according to the provided max number of documents per slice 
and maximum number of
+   * segments per slice. If a segment holds more documents than the provided 
max per slice, it gets

Review Comment:
   `gets` -> `is`?



##########
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##########
@@ -363,7 +374,88 @@ public static LeafSlice[] slices(
     LeafSlice[] slices = new LeafSlice[groupedLeaves.size()];
     int upto = 0;
     for (List<LeafReaderContext> currentLeaf : groupedLeaves) {
-      slices[upto] = new LeafSlice(currentLeaf);
+      slices[upto] =
+          new LeafSlice(
+              new ArrayList<>(
+                  currentLeaf.stream()
+                      .map(LeafReaderContextPartition::createForEntireSegment)
+                      .toList()));
+      ++upto;
+    }
+
+    return slices;
+  }
+
+  /**
+   * Creates leaf slices that leverage intra-segment concurrency by splitting 
segments into multiple
+   * partitions according to the provided max number of documents per slice 
and maximum number of
+   * segments per slice. If a segment holds more documents than the provided 
max per slice, it gets
+   * split into equal size partitions that each gets its own slice assigned.
+   *
+   * <p>Note: this method is not yet called by the default slicing 
implementation {@link
+   * #slices(List)}. Certain queries that require segment-level computation 
ahead of time duplicate

Review Comment:
   Maybe strengthen this a bit: `There is still a performance penalty for 
queries that require segment-level computation ahead of time, such as 
points/range and KNN queries.  This is an implementation limitation that we 
expect to improve in future releases` or so?



##########
lucene/core/src/java/org/apache/lucene/search/TotalHitCountCollectorManager.java:
##########
@@ -28,17 +36,114 @@
  */
 public class TotalHitCountCollectorManager
     implements CollectorManager<TotalHitCountCollector, Integer> {
+
+  private final boolean hasSegmentPartitions;
+
+  /**
+   * Creates a new total hit count collector manager. The collectors returned 
by {@link
+   * #newCollector()} don't support intra-segment concurrency. Use the other 
constructor if segments
+   * partitions are being searched.
+   */
+  public TotalHitCountCollectorManager() {
+    this(false);
+  }
+
+  /**
+   * Creates a new total hit count collector manager, providing a flag that 
signals whether segment
+   * partitions are being searched, in which case the different collector need 
to share state to
+   * ensure consistent behaviour across partitions of the same segment. There 
are segment partitions
+   * when the {@link IndexSearcher#slices(List)} methods returns leaf slices 
that target leaf reader
+   * partitions.
+   *
+   * @see IndexSearcher#slices(List)
+   * @see org.apache.lucene.search.IndexSearcher.LeafReaderContextPartition
+   * @param hasSegmentPartitions whether the collector manager needs to create 
collectors that
+   *     support searching segment partitions in parallel (via intra-segment 
search concurrency)
+   */
+  public TotalHitCountCollectorManager(boolean hasSegmentPartitions) {

Review Comment:
   Hmm ideally the user wouldn't have to pass this boolean, and this manager 
would instead detect itself if it sees multiple partitions?  Will other 
collector managers also need the user to pass this boolean?  (I know hit 
counting is a tricky case thanks to our sub-linear optimizations, yay).
   
   Is this maybe done as a (user visible) optimization to avoid using the 
"partition aware" counting?  It'd be better if that opto lived under the hood, 
somehow ...
   
   Do we at least gracefully detect if the user makes a mistake and passes the 
wrong boolean here vs how the index was sliced/partitioned?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add support for intra-segment search concurrency [lucene]

Reply via email to