[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

Simon Willnauer (Jira) Sun, 08 Mar 2020 12:35:11 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054506#comment-17054506
 ]


Simon Willnauer commented on LUCENE-8962:
-----------------------------------------

{quote}
I don't think we should do this – IndexWriter's purpose is making changes to 
the index, and IndexReader simply reads what IndexWriter created.  There are 
wildly diverse users of Lucene and if we now set down the path of 
expecting/allowing IndexReader to do it's own "little" optimizations on 
startup, that can add a lot of unexpected cost, and surprising bugs, to many 
use cases.  IndexWriter is indeed complex, but we should find ways to reduce 
that complexity so that we can implement features in the right classes, rather 
than shifting index-changing features out to IndexReader.
{quote}

This was really just an idea of how this could be done without adding much 
complexity to existing methods. I can see this being a method like this 
_IndexReader optimize(IndexReader reader, MergePolicy policy)_ than can even 
commit the changes. It can run in parallel to the normal indexing but is 
idempotent and much simpler to test. Again, just an idea to another approach. 

> Can we merge small segments during refresh, for faster searching?
> -----------------------------------------------------------------
>
>                 Key: LUCENE-8962
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8962
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael McCandless
>            Priority: Major
>             Fix For: 8.5
>
>         Attachments: LUCENE-8962_demo.png, failed-tests.patch
>
>          Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

Reply via email to