Adrien Grand created LUCENE-9827:
------------------------------------

             Summary: Small segments are slower to merge due to stored fields 
since 8.7
                 Key: LUCENE-9827
                 URL: https://issues.apache.org/jira/browse/LUCENE-9827
             Project: Lucene - Core
          Issue Type: Bug
            Reporter: Adrien Grand
         Attachments: total-merge-time-by-num-docs-on-small-segments.png

[~dm] and [~dimitrisli] looked into an interesting case where indexing slowed 
down after upgrading to 8.7. After digging we identified that this was due to 
the merging of stored fields, which had become slower on average.

This is due to changes to stored fields, which now have top-level blocks that 
are then split into sub-blocks and compressed using shared dictionaries (one 
dictionary per top-level block). As the top-level blocks are larger than they 
were before, segments are more likely to be considered "dirty" by the merging 
logic. Dirty segments are segments were 1% of the data or more consists of 
incomplete blocks. For large segments, the size of blocks doesn't really affect 
the dirtiness of segments: if you flush a segment that has 100 blocks or more, 
it will never be considered dirty as only the last block may be incomplete. But 
for small segments it does: for instance if your segment is only 10 blocks, it 
is very likely considered dirty given that the last block is always incomplete. 
And the fact that we increased the top-level block size means that segments 
that used to be considered clean might now be considered dirty.

And indeed benchmarks reported that while large stored fields merges became 
slightly faster after upgrading to 8.7, the smaller merges actually became 
slower. See attached chart, which gives the average merge time as a function of 
the number of documents in the segment.

I don't know how we can address this, this is a natural consequence of the 
larger block size, which is needed to achieve better compression ratios. But I 
wanted to open an issue about it in case someone has a bright idea how we could 
make things better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to