Re: [PR] Add support for intra-segment search concurrency [lucene]

via GitHub Mon, 26 Aug 2024 06:44:19 -0700


javanna commented on code in PR #13542:
URL: https://github.com/apache/lucene/pull/13542#discussion_r1731271557



##########
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##########
@@ -890,11 +945,70 @@ public static class LeafSlice {
      *
      * @lucene.experimental
      */
-    public final LeafReaderContext[] leaves;
+    public final LeafReaderContextPartition[] leaves;
 
-    public LeafSlice(List<LeafReaderContext> leavesList) {
-      Collections.sort(leavesList, Comparator.comparingInt(l -> l.docBase));
-      this.leaves = leavesList.toArray(new LeafReaderContext[0]);
+    public LeafSlice(List<LeafReaderContextPartition> 
leafReaderContextPartitions) {
+      leafReaderContextPartitions.sort(Comparator.comparingInt(l -> 
l.ctx.docBase));
+      // TODO should we sort by minDocId too?
+      this.leaves = leafReaderContextPartitions.toArray(new 
LeafReaderContextPartition[0]);
+    }
+
+    /**
+     * Returns the total number of docs that a slice targets, by summing the 
number of docs that
+     * each of its leaf context partitions targets.
+     */
+    public int getNumDocs() {
+      return Arrays.stream(leaves)
+          .map(LeafReaderContextPartition::getNumDocs)
+          .reduce(Integer::sum)
+          .get();
+    }
+  }
+
+  /**
+   * Holds information about a specific leaf context and the corresponding 
range of doc ids to
+   * search within.
+   *
+   * @lucene.experimental
+   */
+  public static final class LeafReaderContextPartition {
+    private final int minDocId;
+    private final int maxDocId;
+    private final int numDocs;
+    public final LeafReaderContext ctx;
+
+    private LeafReaderContextPartition(
+        LeafReaderContext leafReaderContext, int minDocId, int maxDocId, int 
numDocs) {
+      this.ctx = leafReaderContext;
+      this.minDocId = minDocId;
+      this.maxDocId = maxDocId;
+      this.numDocs = numDocs;

Review Comment:
   good with the proposed validation. You both have a good feeling about 
numDocs hiding something here :) it is very subtle but I found that it is 
better to use `DocIdSetIterator.NO_MORE_DOCS`, hence `Integer.MAX_VALUE` when 
targeting the entire segment, or the last partition, as opposed to using 
maxDoc. I ended up hitting weird issues in that case. I may spend more time 
into trying to address this. It did not feel so important at the time given 
that numDocs is only used in tests to introspect how many docs each partition 
holds. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add support for intra-segment search concurrency [lucene]

Reply via email to