[ 
https://issues.apache.org/jira/browse/LUCENE-10556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536227#comment-17536227
 ] 

Adrien Grand commented on LUCENE-10556:
---------------------------------------

I dug a bit further into the benchmark to understand these numbers and noticed 
the benchmark has a flaw: because it uses TieredMergePolicy's default, it wants 
segments to be larger than 2MB. But because docs are small and we flush every 
100 docs, TieredMergePolicy keeps on merging all segments together whenever 
reaching 10 segments in the index. And so there's one big segment (e.g. 100k 
docs) that keeps being merged with 9 tiny 100-docs segments.

So essentially, most merges take one big segment, and increase the number of 
dirty docs by 900 and the number of dirty chunks by 9 without making it 
significantly bigger. Until the segment gets considered too dirty and it gets 
rewritten.

As a comparison point, switching to LogDocMergePolicy with a min segment size 
of 1,000 docs makes the benchmark run in 15 seconds instead of 78 seconds.

> Relax the maximum dirtiness for stored fields and term vectors?
> ---------------------------------------------------------------
>
>                 Key: LUCENE-10556
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10556
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>
> Stored fields and term vectors compress data and have merge-time 
> optimizations to copy compressed data directly instead of decompressing and 
> recompressing over and over again. However, sometimes incomplete blocks get 
> carried over (typically the last block of a flushed segment) and so these 
> file formats keep track of how "dirty" their current blocks are to know 
> whether stored fields / term vectors for a segment should be re-compressed.
> Currently the logic is to recompress if more than 1% of the blocks are 
> incomplete, or if the total number of missing documents across incomplete 
> blocks is more than the configured maximum number of documents per block.
> I'd be interested in evaluating what the compression ratio would be if we 
> relaxed these conditions a bit, e.g. by allowing up to 5% dirtiness. My gut 
> feeling is that the compression ratio could be barely worse while index-time 
> CPU usage could be significantly improved. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to