mikemccand commented on issue #14004: URL: https://github.com/apache/lucene/issues/14004#issuecomment-2494470229
> Interestingly, an index that is less than 1GB can still have 10 segments with the above merge policy because of the constraint to not run merges where the resulting segment is less than 50% bigger than the biggest input segment. E.g. consider the following segment sizes: 100kB, 300kB, 800kB, 2MB, 5MB, 12MB, 30MB, 70MB, 150MB, 400MB. There is no pair of segments where the sum is more than 50% bigger than the max input segment. Hmm shouldn't the `floorSegmentSize=512MB' mean all of these segments are considered the "same size" (level) and all mergeable? Or does that 50% check trump the flooring? > For instance, `TieredMergePolicy` automatically takes the min of `maxMergeAtOnce` and `numSegsPerTier` as a merge factor, but it's not clear to me why this is important. If the merge policy allowed merges to have between 2 and 10 segments in the above example, it could find merges in the described segment structure, and this would likely help havea lower write amplification for the same segment count? I think it did this in order to aim for an index geometry over time that has logarithmically sized levels (ish), where segment sizes tend to cluster into these close-ish levels? It mimics the behavior of the more carefully defined `LogDoc/ByteSizeMergePolicy`. But this doesn't seem intrinsic/important -- +1 to allow it to pick a range of segments to merge at once? "Optimal" merging is hard! I wish we had the perfect merge policy that simply took as input what amortized write amplification is acceptable during indexing, and given that budget would aim to maximize search performance (by some approximate measure)... for apps using NRT segment replication, where one JVM indexes and N JVMs search (physically/virtually different instances) the efficiently incrementally replicated index, they would tolerate possibly very high write amplification (Amazon Product Search falls into this case). But for other apps that are indexing and searching on the same node, there's likely much less tolerance in burning so much CPU/IO to eek out a slightly more efficient index for searching ... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org