jpountz commented on code in PR #13430: URL: https://github.com/apache/lucene/pull/13430#discussion_r1625497128
########## lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java: ########## @@ -522,21 +550,28 @@ private MergeSpecification doFindMerges( final List<SegmentCommitInfo> candidate = new ArrayList<>(); Review Comment: Merging is only necessary if there are more segments than necessary or more deletes than necessary. So this condition just stops looking for more merges when none of these two conditions are met? ########## lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java: ########## @@ -522,21 +550,28 @@ private MergeSpecification doFindMerges( final List<SegmentCommitInfo> candidate = new ArrayList<>(); boolean hitTooLarge = false; long bytesThisMerge = 0; + long docCountThisMerge = 0; for (int idx = startIdx; idx < sortedEligible.size() && candidate.size() < mergeFactor - && bytesThisMerge < maxMergedSegmentBytes; + && bytesThisMerge < maxMergedSegmentBytes + && docCountThisMerge < allowedDocCount; idx++) { final SegmentSizeAndDocs segSizeDocs = sortedEligible.get(idx); final long segBytes = segSizeDocs.sizeInBytes; - + int segDocCount = segSizeDocs.maxDoc - segSizeDocs.delCount; + if (docCountThisMerge + segDocCount > allowedDocCount) { + // We don't want to merge segments that will produce more documents than allowedDocCount + continue; Review Comment: I think we should `break` here, not `continue`. `continue` allows producing merges of segments whose sizes are not adjacent, I don't think we should allow this for the doc count condition, as this potentially makes merging run in quadratic time. ########## lucene/core/src/test/org/apache/lucene/index/TestTieredMergePolicy.java: ########## @@ -77,20 +77,23 @@ protected void assertSegmentInfos(MergePolicy policy, SegmentInfos infos) throws long levelSizeBytes = Math.max(minSegmentBytes, (long) (tmp.getFloorSegmentMB() * 1024 * 1024)); long bytesLeft = totalBytes; double allowedSegCount = 0; + final int maxNumSegmentsOnHighestTier = + (int) Math.max(tmp.getSegmentsPerTier(), tmp.getTargetSearchConcurrency()); // below we make the assumption that segments that reached the max segment // size divided by 2 don't need merging anymore int mergeFactor = (int) Math.min(tmp.getSegmentsPerTier(), tmp.getMaxMergeAtOnce()); while (true) { final double segCountLevel = bytesLeft / (double) levelSizeBytes; - if (segCountLevel < tmp.getSegmentsPerTier() || levelSizeBytes >= maxMergedSegmentBytes / 2) { + if (segCountLevel < maxNumSegmentsOnHighestTier Review Comment: Should it allow `maxNumSegmentsOnHighestTier` for consistency with the code under `main`? Though I don't expect it to make a big difference since `segCountLevel` is a double. ```suggestion if (segCountLevel <= maxNumSegmentsOnHighestTier ``` ########## lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java: ########## @@ -522,21 +550,28 @@ private MergeSpecification doFindMerges( final List<SegmentCommitInfo> candidate = new ArrayList<>(); boolean hitTooLarge = false; long bytesThisMerge = 0; + long docCountThisMerge = 0; for (int idx = startIdx; idx < sortedEligible.size() && candidate.size() < mergeFactor - && bytesThisMerge < maxMergedSegmentBytes; + && bytesThisMerge < maxMergedSegmentBytes + && docCountThisMerge < allowedDocCount; idx++) { final SegmentSizeAndDocs segSizeDocs = sortedEligible.get(idx); final long segBytes = segSizeDocs.sizeInBytes; - + int segDocCount = segSizeDocs.maxDoc - segSizeDocs.delCount; + if (docCountThisMerge + segDocCount > allowedDocCount) { + // We don't want to merge segments that will produce more documents than allowedDocCount + continue; Review Comment: We probably also should produce a singleton merge here in case `candidate.isEmpty()` (see what we're doing for too large segments below). Most likely this suggested merge will not be selected because it will have a low score, though it could end up selected if it reclaims lots of deletes and we don't have better merges. ########## lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java: ########## @@ -916,14 +951,13 @@ public MergeSpecification findForcedDeletesMerges(SegmentInfos infos, MergeConte final Set<SegmentCommitInfo> merging = mergeContext.getMergingSegments(); boolean haveWork = false; + int totalDelCount = 0; for (SegmentCommitInfo info : infos) { int delCount = mergeContext.numDeletesToMerge(info); assert assertDelCount(delCount, info); + totalDelCount += delCount; double pctDeletes = 100. * ((double) delCount) / info.info.maxDoc(); - if (pctDeletes > forceMergeDeletesPctAllowed && !merging.contains(info)) { - haveWork = true; - break; - } + haveWork = haveWork || (pctDeletes > forceMergeDeletesPctAllowed && !merging.contains(info)); Review Comment: Why don't we exit the loop anymore when `haveWork` is true? ########## lucene/core/src/test/org/apache/lucene/index/TestTieredMergePolicy.java: ########## @@ -111,11 +114,29 @@ protected void assertSegmentInfos(MergePolicy policy, SegmentInfos infos) throws } } + // There can be more segments if we can't merge docs - they are balanced between segments + int maxDocsPerSegment = tmp.getMaxAllowedDocs(infos.totalMaxDoc()); + List<Integer> segmentDocs = + infos.asList().stream() + .map(info -> info.info.maxDoc() - info.getDelCount()) + .sorted() + .toList(); + int numEligible = 0; + int currentSum = 0; + for (int i = 0; i < segmentDocs.size(); i++) { + currentSum += segmentDocs.get(i); + if (currentSum > maxDocsPerSegment) { + break; + } + numEligible++; + } + boolean eligibleDocsMerge = numEligible > 1; Review Comment: I wonder if the condition can be made simpler. Would it be ok to check the sum of the doc counts of the two smallest segments? If it's greater than the allowed doc count, then no merging is allowed, otherwise this 2-segments merge would have been a legal merge? ########## lucene/core/src/test/org/apache/lucene/index/TestTieredMergePolicy.java: ########## @@ -111,11 +114,29 @@ protected void assertSegmentInfos(MergePolicy policy, SegmentInfos infos) throws } } + // There can be more segments if we can't merge docs - they are balanced between segments + int maxDocsPerSegment = tmp.getMaxAllowedDocs(infos.totalMaxDoc()); Review Comment: should we subtract deletions to totalMaxDoc? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org