Adrien Grand created LUCENE-10556:
-------------------------------------
Summary: Relax the maximum dirtiness for stored fields and term
vectors?
Key: LUCENE-10556
URL: https://issues.apache.org/jira/browse/LUCENE-10556
Project: Lucene - Core
Issue Type: Improvement
Reporter: Adrien Grand
Stored fields and term vectors compress data and have merge-time optimizations
to copy compressed data directly instead of decompressing and recompressing
over and over again. However, sometimes incomplete blocks get carried over
(typically the last block of a flushed segment) and so these file formats keep
track of how "dirty" their current blocks are to know whether stored fields /
term vectors for a segment should be re-compressed.
Currently the logic is to recompress if more than 1% of the blocks are
incomplete, or if the total number of missing documents across incomplete
blocks is more than the configured maximum number of documents per block.
I'd be interested in evaluating what the compression ratio would be if we
relaxed these conditions a bit, e.g. by allowing up to 5% dirtiness. My gut
feeling is that the compression ratio could be barely worse while index-time
CPU usage could be significantly improved.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]