Adrien Grand created LUCENE-10599:
-------------------------------------

             Summary: Improve LogMergePolicy's handling of maxMergeSize
                 Key: LUCENE-10599
                 URL: https://issues.apache.org/jira/browse/LUCENE-10599
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Adrien Grand


LogMergePolicy excludes from merging segments whose size is greater than or 
equal to maxMergeSize. Since a segment whose size is maxMergeSize-1 is still 
considered for merging, segments will effectively reach a size somewhere 
between maxMergeSize and mergeFactor*maxMergeSize before they are not 
considered for merging anymore.

At least this is what I thought. When LogMergePolicy ignores a segment that is 
too large for merging, it also ignores other segments that are in the same 
window of mergeFactor segments for merging if they are on the same tier. So 
actually segments might reach a size that is somewhere between maxMergeSize / 
mergeFactor^0.75 and maxMergeSize * mergeFactor before they are not considered 
for merging anymore.

Assuming a merge factor of 10 and a max merge size of 1,000 this means that 
segments will reach their maximum size somewhere between 178 and 10,000. This 
range is too large and makes maxMergeSize too hard to reason about?

Specifically, if you have 10 999-docs segments, then LogDocMergePolicy will 
happily merge them into a single 9990-docs segment. However if you have one 
1,000 segment and 9 180-docs segments, then the 180-docs segments will not get 
merged with any other segment, even if you keep adding segments to the index.

I propose to change this behavior so that when a large segment is encountered, 
then we wouldn't skip the entire window of mergeFactor segments, but just the 
segments that are too large.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to