cheng66551 commented on PR #14163:
URL: https://github.com/apache/lucene/pull/14163#issuecomment-2635519801

   > It's terrible that `TieredMergePolicy` was not merging these segments, 
naturally or under `forceMerge` -- let's understand why it's failing to do so? 
It's like we need an `explain` API for its merge selection.
   > 
   > `TMP` does have a `setForceMergeDeletesPctAllowed`, which defaults to 10%, 
meaning if a segment has <= 10% deletions, it won't be selected under 
`forceMerge`. But if I'm reading it right you have a segment `_1btbuk` with 
~82.4% deleted docs (`12507939 / (2666453 + 12507939) = 0.8242794175872088`), 
which should have been selected.
   > 
   > Have you changed `setMaxMergedSegmentMB` away from its default (5 GB)?
   > 
   > Separately, you have crazy high segment names -- I'm curious if this is a 
very long lived index?
   > 
   > This PR reminds me of the Linux "direct IO" struggles. Linus [really does 
not like the existence of "direct IO" (`O_DIRECT` flag to `open` 
API)](https://www.theregister.com/2019/06/21/linus_torvalds_rant/), because its 
existence means users may jump straight to that and take pressure off improving 
how Linux manages IO caching (the buffer cache). I.e. rather than improving the 
kernel's IO caching, users can skip it altogether. It's the same thing here: if 
we expose a merge policy where users can simply pick their own merges, we take 
pressure off of fixing the problems in our default `TieredMergePolicy`. That 
being said, `MergePolicy` is pluggable for exactly this reason: users (well 
direct Lucene users) are free to customize merge selection.
   
   @mikemccand 
   
   1.Both setMaxMergedSegmentMB and setForceMergeDeletesPctAllowed are using 
their default configurations, and no modifications have been made.
   2.This index is a long lived index.
   3.In Elasticsearch, the TieredMergePolicy is wrapped with 
SoftDeletesRetentionMergePolicy. I suspect that there is a large number of soft 
deletions, causing the proportion of deleted documents to be less than 10%, but 
I have no evidence to support this.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to