[ https://issues.apache.org/jira/browse/LUCENE-9827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17298245#comment-17298245 ]
Robert Muir commented on LUCENE-9827: ------------------------------------- {quote} I don't know how we can address this, this is a natural consequence of the larger block size, which is needed to achieve better compression ratios. But I wanted to open an issue about it in case someone has a bright idea how we could make things better. {quote} Well that's the real issue. I'd argue the current heuristic was always bad here, even before the block size increase. But now it happens to be worse because blocksize increased, because compression ratio is better, etc etc. But lets forget about exact sizes and try to just improve the logic... I'll inline the code to make it easier to look at: {code} /** * Returns true if we should recompress this reader, even though we could bulk merge compressed * data * * <p>The last chunk written for a segment is typically incomplete, so without recompressing, in * some worst-case situations (e.g. frequent reopen with tiny flushes), over time the compression * ratio can degrade. This is a safety switch. */ boolean tooDirty(Lucene90CompressingStoredFieldsReader candidate) { // more than 1% dirty, or more than hard limit of 1024 dirty chunks return candidate.getNumDirtyChunks() > 1024 || candidate.getNumDirtyDocs() * 100 > candidate.getNumDocs(); } {code} Please ignore the safety switch for now (1024 dirty chunks), as this isn't relevant to small merges. The logic is really just "more than 1% dirty docs". I think we should avoid recompressing the data over and over for small merges. In other words I don't want to recompress everything for Merge1, Merge2, Merge3, Merge4, Merge5, and then finally at Merge6 we are bulk copying. We were probably doing this before to some extent already. I'd like it if the formula had an "expected value" baked into it. In other words, we only recompress everything on purpose, if its going to result in us getting less dirty than we were before, or something like that (e.g. we think we'll fold some dirty chunks into "complete chunk" and make "progress"). This would have a cost in the compression ratio for the small segments, but it shouldn't be bad in the big scheme of things. > Small segments are slower to merge due to stored fields since 8.7 > ----------------------------------------------------------------- > > Key: LUCENE-9827 > URL: https://issues.apache.org/jira/browse/LUCENE-9827 > Project: Lucene - Core > Issue Type: Bug > Reporter: Adrien Grand > Priority: Minor > Attachments: total-merge-time-by-num-docs-on-small-segments.png > > > [~dm] and [~dimitrisli] looked into an interesting case where indexing slowed > down after upgrading to 8.7. After digging we identified that this was due to > the merging of stored fields, which had become slower on average. > This is due to changes to stored fields, which now have top-level blocks that > are then split into sub-blocks and compressed using shared dictionaries (one > dictionary per top-level block). As the top-level blocks are larger than they > were before, segments are more likely to be considered "dirty" by the merging > logic. Dirty segments are segments were 1% of the data or more consists of > incomplete blocks. For large segments, the size of blocks doesn't really > affect the dirtiness of segments: if you flush a segment that has 100 blocks > or more, it will never be considered dirty as only the last block may be > incomplete. But for small segments it does: for instance if your segment is > only 10 blocks, it is very likely considered dirty given that the last block > is always incomplete. And the fact that we increased the top-level block size > means that segments that used to be considered clean might now be considered > dirty. > And indeed benchmarks reported that while large stored fields merges became > slightly faster after upgrading to 8.7, the smaller merges actually became > slower. See attached chart, which gives the total merge time as a function of > the number of documents in the segment. > I don't know how we can address this, this is a natural consequence of the > larger block size, which is needed to achieve better compression ratios. But > I wanted to open an issue about it in case someone has a bright idea how we > could make things better. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org