[GitHub] [lucene] mikemccand commented on a diff in pull request #900: LUCENE-10574: Prevent pathological merging.

GitBox Wed, 18 May 2022 05:43:02 -0700


mikemccand commented on code in PR #900:
URL: https://github.com/apache/lucene/pull/900#discussion_r875839832



##########
lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java:
##########
@@ -532,13 +532,21 @@ private MergeSpecification doFindMerges(
         // segments, and already pre-excluded the too-large segments:
         assert candidate.size() > 0;
 
+        SegmentSizeAndDocs maxSegmentSize = 
segInfosSizes.get(candidate.get(0));

Review Comment:
   The incoming (sorted) infos are sorted by decreasing size, right? So the 
`candidate.get(0)` is indeed the max.
   
   Maybe rename to `maxCadidateSegmentSize`?



##########
lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java:
##########
@@ -532,13 +532,21 @@ private MergeSpecification doFindMerges(
         // segments, and already pre-excluded the too-large segments:
         assert candidate.size() > 0;
 
+        SegmentSizeAndDocs maxSegmentSize = 
segInfosSizes.get(candidate.get(0));
+        if (hitTooLarge == false
+            && mergeType == MERGE_TYPE.NATURAL
+            && bytesThisMerge < maxSegmentSize.sizeInBytes * 1.5) {
+          // Ignore any merge where the resulting segment is not at least 50% 
larger than the

Review Comment:
   Hmm so this new logic applies to all merges, not just the "under floor" 
ones?  I wonder if there is some risk here that this change will block 
"pathological" merges that we intentionally do today under heavy deletions 
count cases?  Maybe we should pro-rate by deletion percent?  Oh!  I think 
`SegmentSizeAndDocs` already does so (well the `size` method in `MergePolicy`). 
 Maybe we should add a comment / javadoc in this confusing class heh.



##########
lucene/core/src/java/org/apache/lucene/index/LogMergePolicy.java:
##########
@@ -582,23 +589,29 @@ public MergeSpecification findMerges(
         if (anyMerging) {
           // skip
         } else if (!anyTooLarge) {
-          if (spec == null) spec = new MergeSpecification();
-          final List<SegmentCommitInfo> mergeInfos = new ArrayList<>(end - 
start);
-          for (int i = start; i < end; i++) {
-            mergeInfos.add(levels.get(i).info);
-            assert infos.contains(levels.get(i).info);
-          }
-          if (verbose(mergeContext)) {
-            message(
-                "  add merge="
-                    + segString(mergeContext, mergeInfos)
-                    + " start="
-                    + start
-                    + " end="
-                    + end,
-                mergeContext);
-          }
-          spec.add(new OneMerge(mergeInfos));
+          if (mergeSize >= maxSegmentSize * 1.5) {
+            // Ignore any merge where the resulting segment is not at least 
50% larger than the
+            // biggest input segment.
+            // Otherwise we could run into pathological O(N^2) merging where 
merges keep rewriting
+            // again and again the biggest input segment into a segment that 
is barely bigger.
+            if (spec == null) spec = new MergeSpecification();

Review Comment:
   Hmm, split this into multiple lines with { and }?  Does spotless/tidy do 
that?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mikemccand commented on a diff in pull request #900: LUCENE-10574: Prevent pathological merging.

Reply via email to