Vigya Sharma created LUCENE-10216:
-------------------------------------

             Summary: Add concurrency to addIndexes(CodecReader…) API
                 Key: LUCENE-10216
                 URL: https://issues.apache.org/jira/browse/LUCENE-10216
             Project: Lucene - Core
          Issue Type: Improvement
          Components: core/index
            Reporter: Vigya Sharma


I work at Amazon Product Search, and we use Lucene to power search for the 
e-commerce platform. I’m working on a project that involves applying 
metadata+ETL transforms and indexing documents on n different _indexing_ boxes, 
combining them into a single index on a separate _reducer_ box, and making it 
available for queries on m different _search_ boxes (replicas). Segments are 
asynchronously copied from indexers to reducers to searchers as they become 
available for the next layer to consume.

I am using the addIndexes API to combine multiple indexes into one on the 
reducer boxes. Since we also have taxonomy data, we need to remap facet field 
ordinals, which means I need to use the {{addIndexes(CodecReader…)}} version of 
this API. The API leverages {{SegmentMerger.merge()}} to create segments with 
new ordinal values while also merging all provided segments in the process.

_This is however a blocking call that runs in a single thread._ Until we have 
written segments with new ordinal values, we cannot copy them to searcher 
boxes, which increases the time to make documents available for search.

I was playing around with the API by creating multiple concurrent merges, each 
with only a single reader, creating a concurrently running 1:1 conversion from 
old segments to new ones (with new ordinal values). We follow this up with 
non-blocking background merges. This lets us copy the segments to searchers and 
replicas as soon as they are available, and later replace them with merged 
segments as background jobs complete. On the Amazon dataset I profiled, this 
gave us around 2.5 to 3x improvement in addIndexes() time. Each call was given 
about 5 readers to add on average.

This might be useful add to Lucene. We could create another {{addIndexes()}} 
API with a {{boolean}} flag for concurrency, that internally submits multiple 
merge jobs (each with a single reader) to the {{ConcurrentMergeScheduler}}, and 
waits for them to complete before returning.

While this is doable from outside Lucene by using your thread pool, starting 
multiple addIndexes() calls and waiting for them to complete, I felt it needs 
some understanding of what addIndexes does, why you need to wait on the merge 
and why it makes sense to pass a single reader in the addIndexes API.

Out of box support in Lucene could simplify this for folks a similar use case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to