Re: [PR] WIP - Add minimum number of segments to TieredMergePolicy [lucene]

via GitHub Thu, 30 May 2024 01:25:44 -0700


jpountz commented on code in PR #13430:
URL: https://github.com/apache/lucene/pull/13430#discussion_r1619456981



##########
lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java:
##########
@@ -93,6 +93,7 @@ public class TieredMergePolicy extends MergePolicy {
   private double segsPerTier = 10.0;
   private double forceMergeDeletesPctAllowed = 10.0;
   private double deletesPctAllowed = 20.0;
+  private int targetSearchConcurrency = -1;

Review Comment:
   Should it default to 1? I expect it to yield the same behavior as today?



##########
lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java:
##########
@@ -581,11 +621,14 @@ private MergeSpecification doFindMerges(
         // whose length is less than the merge factor, it means we are reaching
         // the tail of the list of segments and will only find smaller merges.
         // Stop here.
-        if (bestScore != null && hitTooLarge == false && candidate.size() < 
mergeFactor) {
+        if (bestScore != null
+            && hitTooLarge == false
+            && hitMaxDocs == false
+            && candidate.size() < mergeFactor) {
           break;
         }
 
-        final MergeScore score = score(candidate, hitTooLarge, segInfosSizes);
+        final MergeScore score = score(candidate, hitTooLarge || hitMaxDocs, 
segInfosSizes);

Review Comment:
   I don't think that we want to do this. `hitTooLarge` is considered a good 
thing by this merge policy. It means that we found a merge that reached the 
maximum merged segment size. It's great because if we perform this merge, then 
this segment will not be eligible again for merging (unless it gets many 
deletes), so we'll essentially be done with it. So these merges are given a 
very good score by assuming that they have a perfect skew.
   
   `hitMaxDocs` is different, we don't want to prioritize unbalanced merges 
that produce segments of `hitMaxDocs` documents? It's better to keep 
prioritizing balanced small segment merges as usual?
   
   My expectation for handling `allowedMaxDoc` is that we would just never 
score any merge that has more than `allowedMaxDoc` documents and force the 
merge policy to select one of the candidate merges that produce fewer documents 
than that.



##########
lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java:
##########
@@ -257,6 +258,25 @@ public double getSegmentsPerTier() {
     return segsPerTier;
   }
 
+  /**
+   * Sets the target search concurrency. This allows merging to ensure that 
there are at least
+   * targetSearchConcurrency segments on the top tier. This setting can be 
overriden by force

Review Comment:
   Let's rather say that this prevents creating segments that are bigger than 
maxDoc/targetSearchConcurrency, which in-turn makes the work parallelizable 
into `targetSearchConcurrency` slices of similar doc counts?
   
   I think it's also worth clarifying that this makes merging less aggressive, 
ie. high values of `targetSearchConcurrency` result in indexes that do less 
merging and have more segments.



##########
lucene/core/src/test/org/apache/lucene/index/TestTieredMergePolicy.java:
##########
@@ -431,6 +447,128 @@ public void testForcedMergesRespectSegSize() throws 
Exception {
     dir.close();
   }
 
+  public void testForcedMergesRespectsTargetSearchConcurrency() throws 
Exception {

Review Comment:
   You may want to check out `BaseMergePolicyTestCase#testSimulateAppendOnly`. 
It simulates merging without actually indexing docs, which allows simulating 
many many rounds of merging in very little time, while comparing how some 
parameters (like `targetSearchConcurrency`) affect merging decisions and 
metrics such as write amplification through merges.



##########
lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java:
##########
@@ -409,6 +433,8 @@ public MergeSpecification findMerges(
     // allowedSegCount may occasionally be less than segsPerTier
     // if segment sizes are below the floor size
     allowedSegCount = Math.max(allowedSegCount, segsPerTier);
+    // Override with the targetSearchConcurrency if it is greater
+    allowedSegCount = Math.max(allowedSegCount, targetSearchConcurrency);

Review Comment:
   I would also do a change like this one a few lines above:
   
   ```patch
   diff --git 
a/lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java 
b/lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java
   index 208aa297287..4126601e51e 100644
   --- a/lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java
   +++ b/lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java
   @@ -392,13 +392,14 @@ public class TieredMergePolicy extends MergePolicy {
        allowedDelCount = Math.max(0, allowedDelCount);
    
        final int mergeFactor = (int) Math.min(maxMergeAtOnce, segsPerTier);
   +    final int maxNumSegmentsOnHighestTier = Math.max(segsPerTier, 
targetSearchConcurrency);
        // Compute max allowed segments in the index
        long levelSize = Math.max(minSegmentBytes, floorSegmentBytes);
        long bytesLeft = totIndexBytes;
        double allowedSegCount = 0;
        while (true) {
          final double segCountLevel = bytesLeft / (double) levelSize;
   -      if (segCountLevel < segsPerTier || levelSize == 
maxMergedSegmentBytes) {
   +      if (segCountLevel <= maxNumSegmentsOnHighestTier || levelSize == 
maxMergedSegmentBytes) {
            allowedSegCount += Math.ceil(segCountLevel);
            break;
          }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] WIP - Add minimum number of segments to TieredMergePolicy [lucene]

Reply via email to