mikemccand commented on issue #14004:
URL: https://github.com/apache/lucene/issues/14004#issuecomment-2494470229

   > Interestingly, an index that is less than 1GB can still have 10 segments 
with the above merge policy because of the constraint to not run merges where 
the resulting segment is less than 50% bigger than the biggest input segment. 
E.g. consider the following segment sizes: 100kB, 300kB, 800kB, 2MB, 5MB, 12MB, 
30MB, 70MB, 150MB, 400MB. There is no pair of segments where the sum is more 
than 50% bigger than the max input segment.
   
   Hmm shouldn't the `floorSegmentSize=512MB' mean all of these segments are 
considered the "same size" (level) and all mergeable?  Or does that 50% check 
trump the flooring?
   
   > For instance, `TieredMergePolicy` automatically takes the min of 
`maxMergeAtOnce` and `numSegsPerTier` as a merge factor, but it's not clear to 
me why this is important. If the merge policy allowed merges to have between 2 
and 10 segments in the above example, it could find merges in the described 
segment structure, and this would likely help havea lower write amplification 
for the same segment count?
   
   I think it did this in order to aim for an index geometry over time that has 
logarithmically sized levels (ish), where segment sizes tend to cluster into 
these close-ish levels?  It mimics the behavior of the more carefully defined 
`LogDoc/ByteSizeMergePolicy`.  But this doesn't seem intrinsic/important -- +1 
to allow it to pick a range of segments to merge at once?
   
   "Optimal" merging is hard!  I wish we had the perfect merge policy that 
simply took as input what amortized write amplification is acceptable during 
indexing, and given that budget would aim to maximize search performance (by 
some approximate measure)... for apps using NRT segment replication, where one 
JVM indexes and N JVMs search (physically/virtually different instances) the 
efficiently incrementally replicated index, they would tolerate possibly very 
high write amplification (Amazon Product Search falls into this case).  But for 
other apps that are indexing and searching on the same node, there's likely 
much less tolerance in burning so much CPU/IO to eek out a slightly more 
efficient index for searching ...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to