[ https://issues.apache.org/jira/browse/LUCENE-9827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343895#comment-17343895 ]
ASF subversion and git services commented on LUCENE-9827: --------------------------------------------------------- Commit 23c34c757111b7ad1d9e8110756ef224eb63ef98 in lucene-solr's branch refs/heads/branch_8x from Robert Muir [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=23c34c7 ] LUCENE-9827: avoid wasteful recompression for small segments (#2495) Require that the segment has enough dirty documents to create a clean chunk before recompressing during merge, there must be at least maxChunkSize. This prevents wasteful recompression with small flushes (e.g. every document): we ensure recompression achieves some "permanent" progress. Expose maxDocsPerChunk as a parameter for Term vectors too, matching the stored fields format. This allows for easy testing. Increment numDirtyDocs for partially optimized merges: If segment N needs recompression, we have to flush any buffered docs before bulk-copying segment N+1. Don't just increment numDirtyChunks, also make sure numDirtyDocs is incremented, too. This doesn't have a performance impact, and is unrelated to tooDirty() improvements, but it is easier to reason about things with correct statistics in the index. Further tuning of how dirtiness is measured: for simplification just use percentage of dirty chunks. Co-authored-by: Adrien Grand <jpou...@gmail.com> > Small segments are slower to merge due to stored fields since 8.7 > ----------------------------------------------------------------- > > Key: LUCENE-9827 > URL: https://issues.apache.org/jira/browse/LUCENE-9827 > Project: Lucene - Core > Issue Type: Bug > Reporter: Adrien Grand > Priority: Minor > Fix For: main (9.0) > > Attachments: Indexer.java, log-and-lucene-9827.patch, > merge-count-by-num-docs.png, merge-type-by-version.png, > total-merge-time-by-num-docs-on-small-segments.png, > total-merge-time-by-num-docs.png > > Time Spent: 1h 20m > Remaining Estimate: 0h > > [~dm] and [~dimitrisli] looked into an interesting case where indexing slowed > down after upgrading to 8.7. After digging we identified that this was due to > the merging of stored fields, which had become slower on average. > This is due to changes to stored fields, which now have top-level blocks that > are then split into sub-blocks and compressed using shared dictionaries (one > dictionary per top-level block). As the top-level blocks are larger than they > were before, segments are more likely to be considered "dirty" by the merging > logic. Dirty segments are segments were 1% of the data or more consists of > incomplete blocks. For large segments, the size of blocks doesn't really > affect the dirtiness of segments: if you flush a segment that has 100 blocks > or more, it will never be considered dirty as only the last block may be > incomplete. But for small segments it does: for instance if your segment is > only 10 blocks, it is very likely considered dirty given that the last block > is always incomplete. And the fact that we increased the top-level block size > means that segments that used to be considered clean might now be considered > dirty. > And indeed benchmarks reported that while large stored fields merges became > slightly faster after upgrading to 8.7, the smaller merges actually became > slower. See attached chart, which gives the total merge time as a function of > the number of documents in the segment. > I don't know how we can address this, this is a natural consequence of the > larger block size, which is needed to achieve better compression ratios. But > I wanted to open an issue about it in case someone has a bright idea how we > could make things better. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org