Re: [PR] WIP: draft of intra segment concurrency [lucene]

via GitHub Mon, 08 Jul 2024 10:45:26 -0700


msokolov commented on code in PR #13542:
URL: https://github.com/apache/lucene/pull/13542#discussion_r1668890994



##########
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##########
@@ -328,42 +336,65 @@ protected LeafSlice[] slices(List<LeafReaderContext> 
leaves) {
   /** Static method to segregate LeafReaderContexts amongst multiple slices */
   public static LeafSlice[] slices(
       List<LeafReaderContext> leaves, int maxDocsPerSlice, int 
maxSegmentsPerSlice) {
+
+    // TODO this is a temporary hack to force testing against multiple leaf 
reader context slices.
+    // It must be reverted before merging.
+    maxDocsPerSlice = 1;
+    maxSegmentsPerSlice = 1;
+    // end hack
+
     // Make a copy so we can sort:
     List<LeafReaderContext> sortedLeaves = new ArrayList<>(leaves);
 
     // Sort by maxDoc, descending:
-    Collections.sort(
-        sortedLeaves, Collections.reverseOrder(Comparator.comparingInt(l -> 
l.reader().maxDoc())));
+    sortedLeaves.sort(Collections.reverseOrder(Comparator.comparingInt(l -> 
l.reader().maxDoc())));
 
-    final List<List<LeafReaderContext>> groupedLeaves = new ArrayList<>();
-    long docSum = 0;
-    List<LeafReaderContext> group = null;
+    final List<List<LeafReaderContextPartition>> groupedLeafPartitions = new 
ArrayList<>();
+    int currentSliceNumDocs = 0;
+    List<LeafReaderContextPartition> group = null;
     for (LeafReaderContext ctx : sortedLeaves) {
       if (ctx.reader().maxDoc() > maxDocsPerSlice) {
         assert group == null;
-        groupedLeaves.add(Collections.singletonList(ctx));
+        // if the segment does not fit in a single slice, we split it in 
multiple partitions of

Review Comment:
   I had worked up a version of this where I modified 
LeafReaderContext/IndexReaderContext to create a new kind of context that 
models the range within a segment. I had added interval start/end to LRC, but I 
suspect a cleaner way would be to make a new thing (IntervalReaderContext or 
so) and then change APIs to expect IndexReaderContext instead of 
CompositeReaderContext? If we do it this way it might make it easier to handle 
some cases like the single-threaded execution you mentioned. But this is more 
about cleaning up the APIs than making it work and we can argue endlessly about 
what is neater, so I think your approach to delay such questions makes sense.



##########
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##########
@@ -328,42 +336,65 @@ protected LeafSlice[] slices(List<LeafReaderContext> 
leaves) {
   /** Static method to segregate LeafReaderContexts amongst multiple slices */
   public static LeafSlice[] slices(
       List<LeafReaderContext> leaves, int maxDocsPerSlice, int 
maxSegmentsPerSlice) {
+
+    // TODO this is a temporary hack to force testing against multiple leaf 
reader context slices.
+    // It must be reverted before merging.
+    maxDocsPerSlice = 1;
+    maxSegmentsPerSlice = 1;
+    // end hack
+
     // Make a copy so we can sort:
     List<LeafReaderContext> sortedLeaves = new ArrayList<>(leaves);
 
     // Sort by maxDoc, descending:
-    Collections.sort(
-        sortedLeaves, Collections.reverseOrder(Comparator.comparingInt(l -> 
l.reader().maxDoc())));
+    sortedLeaves.sort(Collections.reverseOrder(Comparator.comparingInt(l -> 
l.reader().maxDoc())));
 
-    final List<List<LeafReaderContext>> groupedLeaves = new ArrayList<>();
-    long docSum = 0;
-    List<LeafReaderContext> group = null;
+    final List<List<LeafReaderContextPartition>> groupedLeafPartitions = new 
ArrayList<>();
+    int currentSliceNumDocs = 0;
+    List<LeafReaderContextPartition> group = null;
     for (LeafReaderContext ctx : sortedLeaves) {
       if (ctx.reader().maxDoc() > maxDocsPerSlice) {
         assert group == null;
-        groupedLeaves.add(Collections.singletonList(ctx));
+        // if the segment does not fit in a single slice, we split it in 
multiple partitions of
+        // equal size
+        int numSlices = Math.ceilDiv(ctx.reader().maxDoc(), maxDocsPerSlice);

Review Comment:
   My mental model of the whole slice/partition/segment/interval concept is: 
existing physical segments (leaves) divide the index into arbitrary sizes. 
existing slices (what we have today, not what is called slices in this PR) 
group segments together. partitions or intervals (in my view) are a logical 
division of the index into roughly equal-sized contiguous (in docid space) 
portions and they overlay the segments arbitrarily.  Then it is the job of 
IndexSearcher  to map this logical division of work into the underlying 
physical segments.  The main comment here is - let's not confuse ourselves by 
re-using the word "slice" which already means something else!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] WIP: draft of intra segment concurrency [lucene]

Reply via email to