[ https://issues.apache.org/jira/browse/LUCENE-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530476#comment-17530476 ]
Vigya Sharma commented on LUCENE-10216: --------------------------------------- I think the PR is ready for review, with existing tests passing and added tests for new changes. {{OneMerge}} distribution is now provided by a new {{findMerges(CodecReaders[])}} API in {{{}MergePolicy{}}}, and executed by {{MergeScheduler}} threads. I've also modified the {{MockRandomMergePolicy}} to randomly pick a highly concurrent, (one segment per reader), {{findMerges(...)}} implementation 50% of the time. And confirmed manually that tests pass in both scenarios (this new impl., as well as the default impl. being picked) (thanks Michael McCandless for the suggestion). > Add concurrency to addIndexes(CodecReader…) API > ----------------------------------------------- > > Key: LUCENE-10216 > URL: https://issues.apache.org/jira/browse/LUCENE-10216 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index > Reporter: Vigya Sharma > Priority: Major > Time Spent: 6.5h > Remaining Estimate: 0h > > I work at Amazon Product Search, and we use Lucene to power search for the > e-commerce platform. I’m working on a project that involves applying > metadata+ETL transforms and indexing documents on n different _indexing_ > boxes, combining them into a single index on a separate _reducer_ box, and > making it available for queries on m different _search_ boxes (replicas). > Segments are asynchronously copied from indexers to reducers to searchers as > they become available for the next layer to consume. > I am using the addIndexes API to combine multiple indexes into one on the > reducer boxes. Since we also have taxonomy data, we need to remap facet field > ordinals, which means I need to use the {{addIndexes(CodecReader…)}} version > of this API. The API leverages {{SegmentMerger.merge()}} to create segments > with new ordinal values while also merging all provided segments in the > process. > _This is however a blocking call that runs in a single thread._ Until we have > written segments with new ordinal values, we cannot copy them to searcher > boxes, which increases the time to make documents available for search. > I was playing around with the API by creating multiple concurrent merges, > each with only a single reader, creating a concurrently running 1:1 > conversion from old segments to new ones (with new ordinal values). We follow > this up with non-blocking background merges. This lets us copy the segments > to searchers and replicas as soon as they are available, and later replace > them with merged segments as background jobs complete. On the Amazon dataset > I profiled, this gave us around 2.5 to 3x improvement in addIndexes() time. > Each call was given about 5 readers to add on average. > This might be useful add to Lucene. We could create another {{addIndexes()}} > API with a {{boolean}} flag for concurrency, that internally submits multiple > merge jobs (each with a single reader) to the {{ConcurrentMergeScheduler}}, > and waits for them to complete before returning. > While this is doable from outside Lucene by using your thread pool, starting > multiple addIndexes() calls and waiting for them to complete, I felt it needs > some understanding of what addIndexes does, why you need to wait on the merge > and why it makes sense to pass a single reader in the addIndexes API. > Out of box support in Lucene could simplify this for folks a similar use case. -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org