Adrien Grand created LUCENE-10556:
-------------------------------------

             Summary: Relax the maximum dirtiness for stored fields and term 
vectors?
                 Key: LUCENE-10556
                 URL: https://issues.apache.org/jira/browse/LUCENE-10556
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Adrien Grand


Stored fields and term vectors compress data and have merge-time optimizations 
to copy compressed data directly instead of decompressing and recompressing 
over and over again. However, sometimes incomplete blocks get carried over 
(typically the last block of a flushed segment) and so these file formats keep 
track of how "dirty" their current blocks are to know whether stored fields / 
term vectors for a segment should be re-compressed.

Currently the logic is to recompress if more than 1% of the blocks are 
incomplete, or if the total number of missing documents across incomplete 
blocks is more than the configured maximum number of documents per block.

I'd be interested in evaluating what the compression ratio would be if we 
relaxed these conditions a bit, e.g. by allowing up to 5% dirtiness. My gut 
feeling is that the compression ratio could be barely worse while index-time 
CPU usage could be significantly improved. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to