[
https://issues.apache.org/jira/browse/LUCENE-9529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197328#comment-17197328
]
Robert Muir commented on LUCENE-9529:
-------------------------------------
The current code tracks the total number of chunks, and the total number of
"dirty" (incomplete) chunks.
Then we compute "tooDirty" like this:
{code}
/**
* Returns true if we should recompress this reader, even though we could
bulk merge compressed data
* <p>
* The last chunk written for a segment is typically incomplete, so without
recompressing,
* in some worst-case situations (e.g. frequent reopen with tiny flushes),
over time the
* compression ratio can degrade. This is a safety switch.
*/
boolean tooDirty(CompressingStoredFieldsReader candidate) {
// more than 1% dirty, or more than hard limit of 1024 dirty chunks
return candidate.getNumDirtyChunks() > 1024 ||
candidate.getNumDirtyChunks() * 100 > candidate.getNumChunks();
}
{noformat}
Maybe to be more fair, we could use a similar formula but track numDirtyDocs
and compare with numDocs (we know this value already)? We could still keep a
safety-switch such as 1024 dirty chunks to avoid some worst-case scenario, but
just change the ratio at least.
> Larger stored fields block sizes mean we're more likely to disable optimized
> bulk merging
> -----------------------------------------------------------------------------------------
>
> Key: LUCENE-9529
> URL: https://issues.apache.org/jira/browse/LUCENE-9529
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Priority: Minor
>
> Whenever possible when merging stored fields, Lucene tries to copy the
> compressed data instead of decompressing the source segment to then
> re-compressing in the destination segment. A problem with this approach is
> that if some blocks are incomplete (typically the last block of a segment)
> then it remains incomplete in the destination segment too, and if we do it
> for too long we end up with a bad compression ratio. So Lucene keeps track of
> these incomplete blocks, and makes sure to keep a ratio of incomplete blocks
> below 1%.
> But as we increased the block size, it has become more likely to have a high
> ratio of incomplete blocks. E.g. if you have a segment with 1MB of stored
> fields, with 16kB blocks like before, you have 63 complete blocks and 1
> incomplete block, or 1.6%. But now with ~512kB blocks, you have one complete
> block and 1 incomplete block, ie. 50%.
> I'm not sure how to fix it or even whether it should be fixed but wanted to
> open an issue to track this.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]