[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142388#comment-17142388
 ] 

Simon Willnauer commented on LUCENE-8962:
-----------------------------------------

[~msoko...@gmail.com] I think I have bad news for you. I think we forgot about 
an important guarantee that we have from updateDocument that is now with this 
feature enabled not holding up anymore. During a merge we carry over deletes 
that are made concurrently to the source segments. This would mean that if we 
continue including deleted that are carried over in a commit we are currently 
making we don't carry over the corresponding added document which means that 
there are documents not visible in the comment that should be visible. I think 
this is a pretty significant problem and we should roll back this feature for 
now. I am sorry that it took me a while to understand the implications here but 
these IW logs are not the most intuitive thing to read and stare at. 

We need to look into commitMergedDeletes if we can make it work or if we need 
to do the merge again since we now need point in time semantics for merges too. 
I am not sure if that's actually fixable, but lets look into it. [~mikemccand] 
WDYT I think we should roll it back for now. 

> Can we merge small segments during refresh, for faster searching?
> -----------------------------------------------------------------
>
>                 Key: LUCENE-8962
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8962
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael McCandless
>            Priority: Major
>             Fix For: 8.6
>
>         Attachments: LUCENE-8962_demo.png, failed-tests.patch, failure_log.txt
>
>          Time Spent: 19h 40m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to