jpountz commented on code in PR #13430:
URL: https://github.com/apache/lucene/pull/13430#discussion_r1625497128


##########
lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java:
##########
@@ -522,21 +550,28 @@ private MergeSpecification doFindMerges(
         final List<SegmentCommitInfo> candidate = new ArrayList<>();

Review Comment:
   Merging is only necessary if there are more segments than necessary or more 
deletes than necessary. So this condition just stops looking for more merges 
when none of these two conditions are met?



##########
lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java:
##########
@@ -522,21 +550,28 @@ private MergeSpecification doFindMerges(
         final List<SegmentCommitInfo> candidate = new ArrayList<>();
         boolean hitTooLarge = false;
         long bytesThisMerge = 0;
+        long docCountThisMerge = 0;
         for (int idx = startIdx;
             idx < sortedEligible.size()
                 && candidate.size() < mergeFactor
-                && bytesThisMerge < maxMergedSegmentBytes;
+                && bytesThisMerge < maxMergedSegmentBytes
+                && docCountThisMerge < allowedDocCount;
             idx++) {
           final SegmentSizeAndDocs segSizeDocs = sortedEligible.get(idx);
           final long segBytes = segSizeDocs.sizeInBytes;
-
+          int segDocCount = segSizeDocs.maxDoc - segSizeDocs.delCount;
+          if (docCountThisMerge + segDocCount > allowedDocCount) {
+            // We don't want to merge segments that will produce more 
documents than allowedDocCount
+            continue;

Review Comment:
   I think we should `break` here, not `continue`. `continue` allows producing 
merges of segments whose sizes are not adjacent, I don't think we should allow 
this for the doc count condition, as this potentially makes merging run in 
quadratic time.



##########
lucene/core/src/test/org/apache/lucene/index/TestTieredMergePolicy.java:
##########
@@ -77,20 +77,23 @@ protected void assertSegmentInfos(MergePolicy policy, 
SegmentInfos infos) throws
     long levelSizeBytes = Math.max(minSegmentBytes, (long) 
(tmp.getFloorSegmentMB() * 1024 * 1024));
     long bytesLeft = totalBytes;
     double allowedSegCount = 0;
+    final int maxNumSegmentsOnHighestTier =
+        (int) Math.max(tmp.getSegmentsPerTier(), 
tmp.getTargetSearchConcurrency());
     // below we make the assumption that segments that reached the max segment
     // size divided by 2 don't need merging anymore
     int mergeFactor = (int) Math.min(tmp.getSegmentsPerTier(), 
tmp.getMaxMergeAtOnce());
     while (true) {
       final double segCountLevel = bytesLeft / (double) levelSizeBytes;
-      if (segCountLevel < tmp.getSegmentsPerTier() || levelSizeBytes >= 
maxMergedSegmentBytes / 2) {
+      if (segCountLevel < maxNumSegmentsOnHighestTier

Review Comment:
   Should it allow `maxNumSegmentsOnHighestTier` for consistency with the code 
under `main`? Though I don't expect it to make a big difference since 
`segCountLevel` is a double.
   
   ```suggestion
         if (segCountLevel <= maxNumSegmentsOnHighestTier
   ```



##########
lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java:
##########
@@ -522,21 +550,28 @@ private MergeSpecification doFindMerges(
         final List<SegmentCommitInfo> candidate = new ArrayList<>();
         boolean hitTooLarge = false;
         long bytesThisMerge = 0;
+        long docCountThisMerge = 0;
         for (int idx = startIdx;
             idx < sortedEligible.size()
                 && candidate.size() < mergeFactor
-                && bytesThisMerge < maxMergedSegmentBytes;
+                && bytesThisMerge < maxMergedSegmentBytes
+                && docCountThisMerge < allowedDocCount;
             idx++) {
           final SegmentSizeAndDocs segSizeDocs = sortedEligible.get(idx);
           final long segBytes = segSizeDocs.sizeInBytes;
-
+          int segDocCount = segSizeDocs.maxDoc - segSizeDocs.delCount;
+          if (docCountThisMerge + segDocCount > allowedDocCount) {
+            // We don't want to merge segments that will produce more 
documents than allowedDocCount
+            continue;

Review Comment:
   We probably also should produce a singleton merge here in case 
`candidate.isEmpty()` (see what we're doing for too large segments below). Most 
likely this suggested merge will not be selected because it will have a low 
score, though it could end up selected if it reclaims lots of deletes and we 
don't have better merges.



##########
lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java:
##########
@@ -916,14 +951,13 @@ public MergeSpecification 
findForcedDeletesMerges(SegmentInfos infos, MergeConte
     final Set<SegmentCommitInfo> merging = mergeContext.getMergingSegments();
 
     boolean haveWork = false;
+    int totalDelCount = 0;
     for (SegmentCommitInfo info : infos) {
       int delCount = mergeContext.numDeletesToMerge(info);
       assert assertDelCount(delCount, info);
+      totalDelCount += delCount;
       double pctDeletes = 100. * ((double) delCount) / info.info.maxDoc();
-      if (pctDeletes > forceMergeDeletesPctAllowed && !merging.contains(info)) 
{
-        haveWork = true;
-        break;
-      }
+      haveWork = haveWork || (pctDeletes > forceMergeDeletesPctAllowed && 
!merging.contains(info));

Review Comment:
   Why don't we exit the loop anymore when `haveWork` is true?



##########
lucene/core/src/test/org/apache/lucene/index/TestTieredMergePolicy.java:
##########
@@ -111,11 +114,29 @@ protected void assertSegmentInfos(MergePolicy policy, 
SegmentInfos infos) throws
       }
     }
 
+    // There can be more segments if we can't merge docs - they are balanced 
between segments
+    int maxDocsPerSegment = tmp.getMaxAllowedDocs(infos.totalMaxDoc());
+    List<Integer> segmentDocs =
+        infos.asList().stream()
+            .map(info -> info.info.maxDoc() - info.getDelCount())
+            .sorted()
+            .toList();
+    int numEligible = 0;
+    int currentSum = 0;
+    for (int i = 0; i < segmentDocs.size(); i++) {
+      currentSum += segmentDocs.get(i);
+      if (currentSum > maxDocsPerSegment) {
+        break;
+      }
+      numEligible++;
+    }
+    boolean eligibleDocsMerge = numEligible > 1;

Review Comment:
   I wonder if the condition can be made simpler. Would it be ok to check the 
sum of the doc counts of the two smallest segments? If it's greater than the 
allowed doc count, then no merging is allowed, otherwise this 2-segments merge 
would have been a legal merge?



##########
lucene/core/src/test/org/apache/lucene/index/TestTieredMergePolicy.java:
##########
@@ -111,11 +114,29 @@ protected void assertSegmentInfos(MergePolicy policy, 
SegmentInfos infos) throws
       }
     }
 
+    // There can be more segments if we can't merge docs - they are balanced 
between segments
+    int maxDocsPerSegment = tmp.getMaxAllowedDocs(infos.totalMaxDoc());

Review Comment:
   should we subtract deletions to totalMaxDoc?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to