cheng66551 commented on PR #14163: URL: https://github.com/apache/lucene/pull/14163#issuecomment-2635519801
> It's terrible that `TieredMergePolicy` was not merging these segments, naturally or under `forceMerge` -- let's understand why it's failing to do so? It's like we need an `explain` API for its merge selection. > > `TMP` does have a `setForceMergeDeletesPctAllowed`, which defaults to 10%, meaning if a segment has <= 10% deletions, it won't be selected under `forceMerge`. But if I'm reading it right you have a segment `_1btbuk` with ~82.4% deleted docs (`12507939 / (2666453 + 12507939) = 0.8242794175872088`), which should have been selected. > > Have you changed `setMaxMergedSegmentMB` away from its default (5 GB)? > > Separately, you have crazy high segment names -- I'm curious if this is a very long lived index? > > This PR reminds me of the Linux "direct IO" struggles. Linus [really does not like the existence of "direct IO" (`O_DIRECT` flag to `open` API)](https://www.theregister.com/2019/06/21/linus_torvalds_rant/), because its existence means users may jump straight to that and take pressure off improving how Linux manages IO caching (the buffer cache). I.e. rather than improving the kernel's IO caching, users can skip it altogether. It's the same thing here: if we expose a merge policy where users can simply pick their own merges, we take pressure off of fixing the problems in our default `TieredMergePolicy`. That being said, `MergePolicy` is pluggable for exactly this reason: users (well direct Lucene users) are free to customize merge selection. @mikemccand 1.Both setMaxMergedSegmentMB and setForceMergeDeletesPctAllowed are using their default configurations, and no modifications have been made. 2.This index is a long lived index. 3.In Elasticsearch, the TieredMergePolicy is wrapped with SoftDeletesRetentionMergePolicy. I suspect that there is a large number of soft deletions, causing the proportion of deleted documents to be less than 10%, but I have no evidence to support this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org