Adrien Grand created LUCENE-10029:
-------------------------------------

             Summary: Can we make refreshes cheaper via two-phase refresh?
                 Key: LUCENE-10029
                 URL: https://issues.apache.org/jira/browse/LUCENE-10029
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Adrien Grand


Currently our recommendation is to use something like {{SearcherManager}} to 
periodically refresh the current DirectoryReader, asynchronously from searches.

Under the hood, refreshes call {{DirectoryReader#reopen}}, which flushes all 
current {{DocumentsWriterPerThread}} instances so that pending changes become 
visible. For instance a user who would like the view of the index to be 10s old 
at most could refresh every 10 seconds.

But refreshes incur an indexing penalty because they may cause arbitrarily 
small segments to be written, which in-turn means that more merging will need 
to happen later to turn these small segments into larger ones. For data 
structures that may need lots of computation for merging, such as n-dimensional 
points which need to recompute the entire BKD tree or stored fields that might 
need to re-compress blocks of documents, this may be non negligible.

I wonder if we could make this a bit better by making refreshes a two-phase 
operation. The first operation would get the list of all current DWPTs, and the 
second one would consist of flushing them if they haven't been flushed already.

For instance if we take again the example of a user who wants the current 
point-in-time view of the data to be at most 10s old, SearcherManager could be 
configured so that every 5 seconds it would flush all DWPTs that already 
existed 5 seconds earlier. This would give the same guarantee that the current 
point-in-time view of the data is 10s old at most, while also ensuring that we 
never flush a DWPT that has been created less than 5 seconds ago.

At this point, this is only theoretical, I haven't done the work of checking 
whether this is something that would actually help in practice. This would 
likely only help when indexing either fast enough or with a small enough 
indexing buffer so that DWPTs would naturally get flushed because of memory 
usage between consecutive refreshes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to