[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17028727#comment-17028727 ]
Michael Froh commented on LUCENE-8962: -------------------------------------- bq. Yeah I think you are right! That would be a nice simplification. Probably this can just be folded into the existing MergePolicy API as a different MergeTrigger. Though then I wonder why e.g. forceMerge or expungeDeletes are not also simply different triggers ... Michael Froh what do you think? As I was first writing this, I added a {{MergeTrigger.COMMIT}} value and used that, rather than adding a dedicated method. Then I realized that any time I've ever written a custom implementation of {{MergePolicy.findMerges()}}, I've ignored the {{MergeTrigger}} value, because I didn't really care what triggered the merge -- I just wanted to define the {{MergeSpecification}}. Even {{TieredMergePolicy.findMerges()}}} doesn't look at the {{MergeTrigger}} parameter. If I had made {{IndexWriter}} call {{findMerges}} with a {{MergeTrigger.COMMIT}} trigger, anyone with a similar {{MergePolicy}} would have probably ended up running (and blocking on) some pretty expensive merges on commit. The best way I could think of to be backwards compatible with the "old" behavior by default was to add a no-op method to the base class. Looking through the history, it looks like {{forceMerge}} and {{expungeDeletes}} predate {{MergeTrigger}}, so that could explain them. I really like the idea of controlling this with a {{MergeTrigger}}, but I'm concerned about breaking existing {{MergePolicy}} implementations that ignore the {{MergeTrigger}} (which I suspect may be most of them). > Can we merge small segments during refresh, for faster searching? > ----------------------------------------------------------------- > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index > Reporter: Michael McCandless > Priority: Major > Attachments: LUCENE-8962_demo.png > > Time Spent: 4h 40m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org