[ https://issues.apache.org/jira/browse/LUCENE-10556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536201#comment-17536201 ]
Adrien Grand commented on LUCENE-10556: --------------------------------------- I played with tuning the threshold on luceneutil's StoredFieldsBenchmark with 1M geonames docs, and results are quite interesting: ||Dirtiness||Indexing time (msec) || | 0% | 78534 | | 1% | 78691 | | 5% | 78644 | | 20% | 77994 | | 33% | 74871 | |50% | 71064 | |75% | 47466 | |100% | 13052 | Indexing speed doesn't significantly improve until dirtiness is raised to 33%. > Relax the maximum dirtiness for stored fields and term vectors? > --------------------------------------------------------------- > > Key: LUCENE-10556 > URL: https://issues.apache.org/jira/browse/LUCENE-10556 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Adrien Grand > Priority: Minor > > Stored fields and term vectors compress data and have merge-time > optimizations to copy compressed data directly instead of decompressing and > recompressing over and over again. However, sometimes incomplete blocks get > carried over (typically the last block of a flushed segment) and so these > file formats keep track of how "dirty" their current blocks are to know > whether stored fields / term vectors for a segment should be re-compressed. > Currently the logic is to recompress if more than 1% of the blocks are > incomplete, or if the total number of missing documents across incomplete > blocks is more than the configured maximum number of documents per block. > I'd be interested in evaluating what the compression ratio would be if we > relaxed these conditions a bit, e.g. by allowing up to 5% dirtiness. My gut > feeling is that the compression ratio could be barely worse while index-time > CPU usage could be significantly improved. -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org