[jira] [Commented] (LUCENE-9827) Small segments are slower to merge due to stored fields since 8.7

Adrien Grand (Jira) Thu, 11 Mar 2021 02:29:07 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-9827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299477#comment-17299477
 ]


Adrien Grand commented on LUCENE-9827:
--------------------------------------

In the worst-case scenario where someone does 1-document flushes with small 
documents, I believe that this would replace half of the semi-optimized merges 
with optimized merges, so this would certainly help indexing speed.

{{getNumDirtyDocs}} is bounded by the max number of docs per chunk. So assuming 
1kB documents (before compression), each chunk has 480 documents with 
BEST_COMPRESSION (480kB block size). So segments that have 1 dirty chunk cannot 
have more than 479 dirty documents and {{candidate.getNumDirtyDocs() * 100 > 
candidate.getNumDocs()}} always evaluates false if the input segment has more 
than 479*100=47,900 documents. So this change may only affect small segments 
and should have a negligible impact on the space efficiency of large indices.

+1 to merge this change


> Small segments are slower to merge due to stored fields since 8.7
> -----------------------------------------------------------------
>
>                 Key: LUCENE-9827
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9827
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Adrien Grand
>            Priority: Minor
>         Attachments: total-merge-time-by-num-docs-on-small-segments.png
>
>
> [~dm] and [~dimitrisli] looked into an interesting case where indexing slowed 
> down after upgrading to 8.7. After digging we identified that this was due to 
> the merging of stored fields, which had become slower on average.
> This is due to changes to stored fields, which now have top-level blocks that 
> are then split into sub-blocks and compressed using shared dictionaries (one 
> dictionary per top-level block). As the top-level blocks are larger than they 
> were before, segments are more likely to be considered "dirty" by the merging 
> logic. Dirty segments are segments were 1% of the data or more consists of 
> incomplete blocks. For large segments, the size of blocks doesn't really 
> affect the dirtiness of segments: if you flush a segment that has 100 blocks 
> or more, it will never be considered dirty as only the last block may be 
> incomplete. But for small segments it does: for instance if your segment is 
> only 10 blocks, it is very likely considered dirty given that the last block 
> is always incomplete. And the fact that we increased the top-level block size 
> means that segments that used to be considered clean might now be considered 
> dirty.
> And indeed benchmarks reported that while large stored fields merges became 
> slightly faster after upgrading to 8.7, the smaller merges actually became 
> slower. See attached chart, which gives the total merge time as a function of 
> the number of documents in the segment.
> I don't know how we can address this, this is a natural consequence of the 
> larger block size, which is needed to achieve better compression ratios. But 
> I wanted to open an issue about it in case someone has a bright idea how we 
> could make things better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9827) Small segments are slower to merge due to stored fields since 8.7

Reply via email to