[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054483#comment-17054483 ]
Michael McCandless commented on LUCENE-8962: -------------------------------------------- {quote}I have translated some Elasticsearch tests (see [^failed-tests.patch]) to Lucene. {quote} Thanks [~dnhatn] – it's no good if ES tests are failing but Lucene's tests are not! Are these ES test runs / logs publicly visible somewhere? {quote}Wouldn’t it be easier to sort it all out in a branch rather on master? {quote} +1, let's revert on master too, until we can work through these new randomized testing failures. This is certainly tricky code, and we should iterate outside master/8.x until it's working well with all tests. {quote}I'm thinking of adding a {{boolean}} field to {{OneMerge}} that gets set once a merge is successfully committed (e.g. just before the call to {{closeMergeReaders}} in {{IndexWriter.commitMerge()}}), which the {{mergeFinished}} override can use to determine if the merge completed successfully or not. {quote} +1 – clearly {{mergeFinished}} needs some way to know if the merge was successful or not. Maybe open a separate issue/PR to solve this first, as a precursor? {quote}With a slightly refactored IW we can share the merge logic and let the reader re-write itself since we are talking about very small segments the overhead is very small. {quote} I don't think we should do this – {{IndexWriter}}'s purpose is making changes to the index, and {{IndexReader}} simply reads what {{IndexWriter}} created. There are wildly diverse users of Lucene and if we now set down the path of expecting/allowing {{IndexReader}} to do it's own "little" optimizations on startup, that can add a lot of unexpected cost, and surprising bugs, to many use cases. {{IndexWriter}} is indeed complex, but we should find ways to reduce that complexity so that we can implement features in the right classes, rather than shifting index-changing features out to {{IndexReader}}. In our (Amazon customer facing product search) usage of Lucene, this optimization is impactful because a single index is replicated across deep replica count, and shifting work that could be done once by {{IndexWriter}} to every {{IndexReader}} that later opens that commit point would add quite a bit of cost to those readers. That said, I'm definitely +1 to find a simpler way to achieve this powerful feature, just within {{IndexWriter}}. If you look at the discussion in the original PR, we considered making it just another trigger under the existing {{MergePolicy}} methods, but there were downsides (discussed on the PR) to that. Also, this feature/change is only targeting {{commit}} which is already a simplification (NOT trying to target NRT readers). But maybe [~simonwillnauer] you see a simpler way to implement this during {{IndexWriter.commit}}? > Can we merge small segments during refresh, for faster searching? > ----------------------------------------------------------------- > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index > Reporter: Michael McCandless > Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch > > Time Spent: 9.5h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org