jpountz commented on code in PR #13430: URL: https://github.com/apache/lucene/pull/13430#discussion_r1619456981
########## lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java: ########## @@ -93,6 +93,7 @@ public class TieredMergePolicy extends MergePolicy { private double segsPerTier = 10.0; private double forceMergeDeletesPctAllowed = 10.0; private double deletesPctAllowed = 20.0; + private int targetSearchConcurrency = -1; Review Comment: Should it default to 1? I expect it to yield the same behavior as today? ########## lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java: ########## @@ -581,11 +621,14 @@ private MergeSpecification doFindMerges( // whose length is less than the merge factor, it means we are reaching // the tail of the list of segments and will only find smaller merges. // Stop here. - if (bestScore != null && hitTooLarge == false && candidate.size() < mergeFactor) { + if (bestScore != null + && hitTooLarge == false + && hitMaxDocs == false + && candidate.size() < mergeFactor) { break; } - final MergeScore score = score(candidate, hitTooLarge, segInfosSizes); + final MergeScore score = score(candidate, hitTooLarge || hitMaxDocs, segInfosSizes); Review Comment: I don't think that we want to do this. `hitTooLarge` is considered a good thing by this merge policy. It means that we found a merge that reached the maximum merged segment size. It's great because if we perform this merge, then this segment will not be eligible again for merging (unless it gets many deletes), so we'll essentially be done with it. So these merges are given a very good score by assuming that they have a perfect skew. `hitMaxDocs` is different, we don't want to prioritize unbalanced merges that produce segments of `hitMaxDocs` documents? It's better to keep prioritizing balanced small segment merges as usual? My expectation for handling `allowedMaxDoc` is that we would just never score any merge that has more than `allowedMaxDoc` documents and force the merge policy to select one of the candidate merges that produce fewer documents than that. ########## lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java: ########## @@ -257,6 +258,25 @@ public double getSegmentsPerTier() { return segsPerTier; } + /** + * Sets the target search concurrency. This allows merging to ensure that there are at least + * targetSearchConcurrency segments on the top tier. This setting can be overriden by force Review Comment: Let's rather say that this prevents creating segments that are bigger than maxDoc/targetSearchConcurrency, which in-turn makes the work parallelizable into `targetSearchConcurrency` slices of similar doc counts? I think it's also worth clarifying that this makes merging less aggressive, ie. high values of `targetSearchConcurrency` result in indexes that do less merging and have more segments. ########## lucene/core/src/test/org/apache/lucene/index/TestTieredMergePolicy.java: ########## @@ -431,6 +447,128 @@ public void testForcedMergesRespectSegSize() throws Exception { dir.close(); } + public void testForcedMergesRespectsTargetSearchConcurrency() throws Exception { Review Comment: You may want to check out `BaseMergePolicyTestCase#testSimulateAppendOnly`. It simulates merging without actually indexing docs, which allows simulating many many rounds of merging in very little time, while comparing how some parameters (like `targetSearchConcurrency`) affect merging decisions and metrics such as write amplification through merges. ########## lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java: ########## @@ -409,6 +433,8 @@ public MergeSpecification findMerges( // allowedSegCount may occasionally be less than segsPerTier // if segment sizes are below the floor size allowedSegCount = Math.max(allowedSegCount, segsPerTier); + // Override with the targetSearchConcurrency if it is greater + allowedSegCount = Math.max(allowedSegCount, targetSearchConcurrency); Review Comment: I would also do a change like this one a few lines above: ```patch diff --git a/lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java b/lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java index 208aa297287..4126601e51e 100644 --- a/lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java +++ b/lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java @@ -392,13 +392,14 @@ public class TieredMergePolicy extends MergePolicy { allowedDelCount = Math.max(0, allowedDelCount); final int mergeFactor = (int) Math.min(maxMergeAtOnce, segsPerTier); + final int maxNumSegmentsOnHighestTier = Math.max(segsPerTier, targetSearchConcurrency); // Compute max allowed segments in the index long levelSize = Math.max(minSegmentBytes, floorSegmentBytes); long bytesLeft = totIndexBytes; double allowedSegCount = 0; while (true) { final double segCountLevel = bytesLeft / (double) levelSize; - if (segCountLevel < segsPerTier || levelSize == maxMergedSegmentBytes) { + if (segCountLevel <= maxNumSegmentsOnHighestTier || levelSize == maxMergedSegmentBytes) { allowedSegCount += Math.ceil(segCountLevel); break; } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org