[ https://issues.apache.org/jira/browse/LUCENE-10556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536219#comment-17536219 ]
Robert Muir commented on LUCENE-10556: -------------------------------------- maybe we could try playing instead with the second threshold, e.g. require enough dirty docs to fill 2 blocks, 4 blocks, 8 blocks, etc. Just as an experiment. For this benchmark, you might be able to reduce the count of slow merges this way without increasing the space too much. > Relax the maximum dirtiness for stored fields and term vectors? > --------------------------------------------------------------- > > Key: LUCENE-10556 > URL: https://issues.apache.org/jira/browse/LUCENE-10556 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Adrien Grand > Priority: Minor > > Stored fields and term vectors compress data and have merge-time > optimizations to copy compressed data directly instead of decompressing and > recompressing over and over again. However, sometimes incomplete blocks get > carried over (typically the last block of a flushed segment) and so these > file formats keep track of how "dirty" their current blocks are to know > whether stored fields / term vectors for a segment should be re-compressed. > Currently the logic is to recompress if more than 1% of the blocks are > incomplete, or if the total number of missing documents across incomplete > blocks is more than the configured maximum number of documents per block. > I'd be interested in evaluating what the compression ratio would be if we > relaxed these conditions a bit, e.g. by allowing up to 5% dirtiness. My gut > feeling is that the compression ratio could be barely worse while index-time > CPU usage could be significantly improved. -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org