[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054483#comment-17054483
 ] 

Michael McCandless commented on LUCENE-8962:
--------------------------------------------

{quote}I have translated some Elasticsearch tests (see [^failed-tests.patch]) 
to Lucene.
{quote}
Thanks [~dnhatn] – it's no good if ES tests are failing but Lucene's tests are 
not!  Are these ES test runs / logs publicly visible somewhere?
{quote}Wouldn’t it be easier to sort it all out in a branch rather on master?
{quote}
+1, let's revert on master too, until we can work through these new randomized 
testing failures.  This is certainly tricky code, and we should iterate outside 
master/8.x until it's working well with all tests.
{quote}I'm thinking of adding a {{boolean}} field to {{OneMerge}} that gets set 
once a merge is successfully committed (e.g. just before the call to 
{{closeMergeReaders}} in {{IndexWriter.commitMerge()}}), which the 
{{mergeFinished}} override can use to determine if the merge completed 
successfully or not.
{quote}
+1 – clearly {{mergeFinished}} needs some way to know if the merge was 
successful or not.  Maybe open a separate issue/PR to solve this first, as a 
precursor?
{quote}With a slightly refactored IW we can share the merge logic and let the 
reader re-write itself since we are talking about very small segments the 
overhead is very small. 
{quote}
I don't think we should do this – {{IndexWriter}}'s purpose is making changes 
to the index, and {{IndexReader}} simply reads what {{IndexWriter}} created.  
There are wildly diverse users of Lucene and if we now set down the path of 
expecting/allowing {{IndexReader}} to do it's own "little" optimizations on 
startup, that can add a lot of unexpected cost, and surprising bugs, to many 
use cases.  {{IndexWriter}} is indeed complex, but we should find ways to 
reduce that complexity so that we can implement features in the right classes, 
rather than shifting index-changing features out to {{IndexReader}}.

In our (Amazon customer facing product search) usage of Lucene, this 
optimization is impactful because a single index is replicated across deep 
replica count, and shifting work that could be done once by {{IndexWriter}} to 
every {{IndexReader}} that later opens that commit point would add quite a bit 
of cost to those readers.

That said, I'm definitely +1 to find a simpler way to achieve this powerful 
feature, just within {{IndexWriter}}.  If you look at the discussion in the 
original PR, we considered making it just another trigger under the existing 
{{MergePolicy}} methods, but there were downsides (discussed on the PR) to 
that.  Also, this feature/change is only targeting {{commit}} which is already 
a simplification (NOT trying to target NRT readers).  But maybe 
[~simonwillnauer] you see a simpler way to implement this during 
{{IndexWriter.commit}}?

> Can we merge small segments during refresh, for faster searching?
> -----------------------------------------------------------------
>
>                 Key: LUCENE-8962
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8962
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael McCandless
>            Priority: Major
>             Fix For: 8.5
>
>         Attachments: LUCENE-8962_demo.png, failed-tests.patch
>
>          Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to