vigyasharma commented on issue #13226:
URL: https://github.com/apache/lucene/issues/13226#issuecomment-2033392249

   `TieredMergePolicy` prefers merges that have less skew across segment sizes, 
smaller size, and higher no. of expunged deletes. Each merge here is a set of 
segments that will be merged into a single segment (eventually this becomes 
a`OneMerge` object). To do this curation, the policy assigns a *merge score* to 
each merge, and lower values of the score are preferred for merging.
   
   __
   
   ```java
   [2024-03-20T15:46:36,015][TRACE][o.e.i.e.E.MP ]: Lucene Merge Thread 
#403832] MP:   maybe=_1wtuc(8.7.0):C29948777/29948777:[diagnostics={os=Linux, 
java.version=11.0.17, os.arch=amd64, java.runtime.version=11.0.17+9-LTS, 
source=merge, ... :softDel=12100433 :id=i9yc9l5c6qvt26u9srmz8umo 
score=2.220052984489907 skew=0.713 nonDelRatio=1.000 tooLarge=false 
size=7083.170 MB
   ```
   
   From the log above, `_1wtuc` seems to have a high skew value (it ranges from 
`1/mergeFactor = 0.1` (best) to 1 (worst)), but what stands out is the high 
value of `nonDelRatio = 1.000`.
   
   
   **nonDelRatio** is 
[calculated](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java#L700)
 as `totalBytesAfterMerge / totalBytesBeforeMerge`, and gives a sense of the 
no. of deletes that merge would expunge. A high value (1 being highest) 
indicates that merge will not reclaim any deletes!
   
   The value for `totalBytesAfterMerge` comes from summing up the post-merge 
size of each segment, which is 
[computed](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/MergePolicy.java#L753-L765)
 by prorating the size of expunged deletes: `segmentSize * (1 - 
reclaimableDeletes/maxDoc)`. The no. of reclaimable deletes is fetched from 
`numDeletesToMerge()` in the merge policy, which can be overridden by 
implementations like `SoftDeletesRetentionMergePolicy` to retain soft deleted 
documents in the segment post merge.
   
   It is likely that for this segment, even though we have a high no. of 
deletes, `SoftDeletesRetentionMergePolicy` is retaining all of them, causing 
`nonDelRatio` to be 1. Would help to look at your 
**SoftDeletesRetentionMergePolicy** implementation.
   
   ...
   
   As a side note, is the log line above truncated? Because going by 
`C29948777/29948777` and `:softDel=12100433` -  the size of pending deletes in 
the segment is `29948777` (same as total docs), while no. of soft deletes is 
`12100433`, (only 60% of total pending deletes). Even if all of them are 
retained by the merge policy, there should still be 40% deletes that merge can 
reclaim. I wonder if some info, like details of other segments in the merge, 
got truncated from the log line.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to