[
https://issues.apache.org/jira/browse/LUCENE-9827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302454#comment-17302454
]
Robert Muir commented on LUCENE-9827:
-------------------------------------
[~dm] Thanks for testing and reporting back!
We may be able to improve the performance more, this was just a first stab
making the minimum possible change (it is best to tread with caution on this
issue IMO).
Personally I think the {{candidate.getNumDirtyChunks() > 1}} was a necessary
first step to fix our staircase. Increasing this {{1}} to higher values
probably isn't a good idea, e.g. small merges now get optimized around "half"
the time, e.g. we'll merge 5 segments each with {{1}} dirty chunk into a new
segment of {{5}} dirty chunks, and then recompress that the next time around.
But maybe we can consider the 1% threshold? Perhaps it is too aggressive and we
should adjust it for better overall tradeoff of compression ratio and speed.
> Small segments are slower to merge due to stored fields since 8.7
> -----------------------------------------------------------------
>
> Key: LUCENE-9827
> URL: https://issues.apache.org/jira/browse/LUCENE-9827
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Adrien Grand
> Priority: Minor
> Attachments: log-and-lucene-9827.patch, merge-count-by-num-docs.png,
> merge-type-by-version.png,
> total-merge-time-by-num-docs-on-small-segments.png,
> total-merge-time-by-num-docs.png
>
>
> [~dm] and [~dimitrisli] looked into an interesting case where indexing slowed
> down after upgrading to 8.7. After digging we identified that this was due to
> the merging of stored fields, which had become slower on average.
> This is due to changes to stored fields, which now have top-level blocks that
> are then split into sub-blocks and compressed using shared dictionaries (one
> dictionary per top-level block). As the top-level blocks are larger than they
> were before, segments are more likely to be considered "dirty" by the merging
> logic. Dirty segments are segments were 1% of the data or more consists of
> incomplete blocks. For large segments, the size of blocks doesn't really
> affect the dirtiness of segments: if you flush a segment that has 100 blocks
> or more, it will never be considered dirty as only the last block may be
> incomplete. But for small segments it does: for instance if your segment is
> only 10 blocks, it is very likely considered dirty given that the last block
> is always incomplete. And the fact that we increased the top-level block size
> means that segments that used to be considered clean might now be considered
> dirty.
> And indeed benchmarks reported that while large stored fields merges became
> slightly faster after upgrading to 8.7, the smaller merges actually became
> slower. See attached chart, which gives the total merge time as a function of
> the number of documents in the segment.
> I don't know how we can address this, this is a natural consequence of the
> larger block size, which is needed to achieve better compression ratios. But
> I wanted to open an issue about it in case someone has a bright idea how we
> could make things better.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]