[
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027006#comment-17027006
]
David Smiley commented on LUCENE-8962:
--------------------------------------
Woah; I defer to your expertise [~mikemccand]. IndexWriter has a huge bus
factor and I haven't delved into it. Still... I want to confirm what I think
you are telling me. Based on my understanding of when merges are triggered and
observed (by a reader/searcher), I wrote the following test on
{{TestIndexWriterMergePolicy}}:
{code:java}
public void testMergeOnCommitIsSearchable() throws IOException {
try (
Directory dir = newDirectory();
IndexWriter writer = new IndexWriter(dir, newIndexWriterConfig(new
MockAnalyzer(random()))
.setMaxBufferedDocs(10)
.setMergePolicy(new LogDocMergePolicy())
.setMergeScheduler(new SerialMergeScheduler()))
) {
for (int i = 0; i < 99; i++) {
addDoc(writer);
checkInvariants(writer);
}
assertEquals(9, writer.getSegmentCount());
assertEquals(9, writer.getNumBufferedDocuments());
writer.commit();
try (DirectoryReader reader = DirectoryReader.open(writer)) {
assertEquals(1, reader.getSequentialSubReaders().size());
}
}
}
{code}
Generally speaking and scene here, after a commit, a search application will
open a new reader to be able to search over the recently committed documents.
In the scenario above, I index a bunch of documents, some of which have been
flushed already, some pending. Also notice the SerialMergeScheduler so that
the writing thread merges in-process / synchronously. Then see I open a NRT
reader from the writer and count the segments. I get 1, because the flushed
buffer will produce the 10th segment and the configured LogDocMergePolicy will
merge altogether on 10.
_The test passes_. So; am I missing something?
> Can we merge small segments during refresh, for faster searching?
> -----------------------------------------------------------------
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index
> Reporter: Michael McCandless
> Priority: Major
> Attachments: LUCENE-8962_demo.png
>
> Time Spent: 3h 10m
> Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory
> segments to disk and open an {{IndexReader}} to search them, and this is
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}}
> will accumulate write many small segments during {{refresh}} and this then
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if
> given a little time ... so, could we somehow improve {{IndexWriter'}}s
> refresh to optionally kick off merge policy to merge segments below some
> threshold before opening the near-real-time reader? It'd be a bit tricky
> because while we are waiting for merges, indexing may continue, and new
> segments may be flushed, but those new segments shouldn't be included in the
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy,
> and some hackity logic to have the merge policy target small segments just
> written by refresh, but it's tricky to then open a near-real-time reader,
> excluding newly flushed but including newly merged segments since the refresh
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for
> discussion!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]