[jira] [Commented] (LUCENE-10216) Add concurrency to addIndexes(CodecReader…) API
[ https://issues.apache.org/jira/browse/LUCENE-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530504#comment-17530504 ] Robert Muir commented on LUCENE-10216: -- {quote} I've also modified the MockRandomMergePolicy to randomly pick a highly concurrent, (one segment per reader), findMerges(...) implementation 50% of the time. {quote} Is this going to make 50% of our test runs non-reproducible? Please, let's consider the need to reproduce failing runs and not pick such highly concurrent stuff for the unit tests. We should be able to test it another way. > Add concurrency to addIndexes(CodecReader…) API > --- > > Key: LUCENE-10216 > URL: https://issues.apache.org/jira/browse/LUCENE-10216 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Vigya Sharma >Priority: Major > Time Spent: 6.5h > Remaining Estimate: 0h > > I work at Amazon Product Search, and we use Lucene to power search for the > e-commerce platform. I’m working on a project that involves applying > metadata+ETL transforms and indexing documents on n different _indexing_ > boxes, combining them into a single index on a separate _reducer_ box, and > making it available for queries on m different _search_ boxes (replicas). > Segments are asynchronously copied from indexers to reducers to searchers as > they become available for the next layer to consume. > I am using the addIndexes API to combine multiple indexes into one on the > reducer boxes. Since we also have taxonomy data, we need to remap facet field > ordinals, which means I need to use the {{addIndexes(CodecReader…)}} version > of this API. The API leverages {{SegmentMerger.merge()}} to create segments > with new ordinal values while also merging all provided segments in the > process. > _This is however a blocking call that runs in a single thread._ Until we have > written segments with new ordinal values, we cannot copy them to searcher > boxes, which increases the time to make documents available for search. > I was playing around with the API by creating multiple concurrent merges, > each with only a single reader, creating a concurrently running 1:1 > conversion from old segments to new ones (with new ordinal values). We follow > this up with non-blocking background merges. This lets us copy the segments > to searchers and replicas as soon as they are available, and later replace > them with merged segments as background jobs complete. On the Amazon dataset > I profiled, this gave us around 2.5 to 3x improvement in addIndexes() time. > Each call was given about 5 readers to add on average. > This might be useful add to Lucene. We could create another {{addIndexes()}} > API with a {{boolean}} flag for concurrency, that internally submits multiple > merge jobs (each with a single reader) to the {{ConcurrentMergeScheduler}}, > and waits for them to complete before returning. > While this is doable from outside Lucene by using your thread pool, starting > multiple addIndexes() calls and waiting for them to complete, I felt it needs > some understanding of what addIndexes does, why you need to wait on the merge > and why it makes sense to pass a single reader in the addIndexes API. > Out of box support in Lucene could simplify this for folks a similar use case. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.
rmuir commented on code in PR #633: URL: https://github.com/apache/lucene/pull/633#discussion_r862451702 ## lucene/test-framework/src/java/org/apache/lucene/tests/index/MockRandomMergePolicy.java: ## @@ -86,6 +86,20 @@ public MergeSpecification findMerges( return mergeSpec; } + @Override + public MergeSpecification findMerges(CodecReader... readers) throws IOException { +if (random.nextBoolean() == true) { + return super.findMerges(readers); +} else { + // create a oneMerge for each reader to let them get concurrently processed by addIndexes() Review Comment: Can we do this much much less often, possibly only at nightly, or not at all? It is very easy to transition from "repeatable randomized tests" to "chaotic flaky non-reproducing madness" by adding a bunch of concurrency everywhere. It's not the way. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on a diff in pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.
uschindler commented on code in PR #633: URL: https://github.com/apache/lucene/pull/633#discussion_r862455421 ## lucene/test-framework/src/java/org/apache/lucene/tests/index/MockRandomMergePolicy.java: ## @@ -86,6 +86,20 @@ public MergeSpecification findMerges( return mergeSpec; } + @Override + public MergeSpecification findMerges(CodecReader... readers) throws IOException { +if (random.nextBoolean() == true) { + return super.findMerges(readers); +} else { + // create a oneMerge for each reader to let them get concurrently processed by addIndexes() Review Comment: Yes please don't do this for all tests. Not even nightly. Please add a specific test only for concurrently merging addindexes, bit not make it the default for all uses of IndexWriter in unit tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on a diff in pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.
uschindler commented on code in PR #633: URL: https://github.com/apache/lucene/pull/633#discussion_r862455421 ## lucene/test-framework/src/java/org/apache/lucene/tests/index/MockRandomMergePolicy.java: ## @@ -86,6 +86,20 @@ public MergeSpecification findMerges( return mergeSpec; } + @Override + public MergeSpecification findMerges(CodecReader... readers) throws IOException { +if (random.nextBoolean() == true) { + return super.findMerges(readers); +} else { + // create a oneMerge for each reader to let them get concurrently processed by addIndexes() Review Comment: Yes please don't do this for all tests. Not even nightly. Please add a specific test only for concurrently merging addIndexes, but not make it the default for all uses of IndexWriter in unit tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on a diff in pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.
msokolov commented on code in PR #633: URL: https://github.com/apache/lucene/pull/633#discussion_r862485790 ## lucene/test-framework/src/java/org/apache/lucene/tests/index/MockRandomMergePolicy.java: ## @@ -86,6 +86,20 @@ public MergeSpecification findMerges( return mergeSpec; } + @Override + public MergeSpecification findMerges(CodecReader... readers) throws IOException { +if (random.nextBoolean() == true) { + return super.findMerges(readers); +} else { + // create a oneMerge for each reader to let them get concurrently processed by addIndexes() Review Comment: Perhaps create specific tests for the ones that actually failed when this was enabled for all tests? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on a diff in pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.
uschindler commented on code in PR #633: URL: https://github.com/apache/lucene/pull/633#discussion_r862504146 ## lucene/test-framework/src/java/org/apache/lucene/tests/index/MockRandomMergePolicy.java: ## @@ -86,6 +86,20 @@ public MergeSpecification findMerges( return mergeSpec; } + @Override + public MergeSpecification findMerges(CodecReader... readers) throws IOException { +if (random.nextBoolean() == true) { + return super.findMerges(readers); +} else { + // create a oneMerge for each reader to let them get concurrently processed by addIndexes() Review Comment: I don't think those tests are wrong. The tests are there to test specific merging behaviour. If they only expect one segment they should only get one segment. Those tests should not suddenly get a concurrent addIndexes behaviour. That's not what they assert. So there is definitely nothing wrong with those tests. Changes in them should be reverted! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on a diff in pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.
uschindler commented on code in PR #633: URL: https://github.com/apache/lucene/pull/633#discussion_r862504930 ## lucene/test-framework/src/java/org/apache/lucene/tests/index/MockRandomMergePolicy.java: ## @@ -86,6 +86,20 @@ public MergeSpecification findMerges( return mergeSpec; } + @Override + public MergeSpecification findMerges(CodecReader... readers) throws IOException { +if (random.nextBoolean() == true) { + return super.findMerges(readers); +} else { + // create a oneMerge for each reader to let them get concurrently processed by addIndexes() Review Comment: But there should be an additional test that checks if concurrent addIndexes works as expected. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] vigyasharma commented on a diff in pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.
vigyasharma commented on code in PR #633: URL: https://github.com/apache/lucene/pull/633#discussion_r862515687 ## lucene/test-framework/src/java/org/apache/lucene/tests/index/MockRandomMergePolicy.java: ## @@ -86,6 +86,20 @@ public MergeSpecification findMerges( return mergeSpec; } + @Override + public MergeSpecification findMerges(CodecReader... readers) throws IOException { +if (random.nextBoolean() == true) { + return super.findMerges(readers); +} else { + // create a oneMerge for each reader to let them get concurrently processed by addIndexes() Review Comment: Thanks for reviewing this @rmuir, @uschindler. I understand your concern. Failures on concurrent code are already tough to reproduce; it gets much harder to do this from a randomized test seed. It is worth noting that the `findMerges(CodecReader[])` API modified in `MockRandomMergePolicy`, is only invoked within the `addIndexes(CodecReaders[])` API, which should contain its scope of impact. But I definitely don't want to add noise with random failures on distantly related tests. I'll remove this new policy from `MockRandomMergePolicy`, which is frequently used for `IndexWriterConfig` in multiple tests. For my awareness, what is the best practice in Lucene for testing non-default, concurrency related code paths? Randomized tests help validate infrequently triggered code, but concurrency poses has reproducibility challenges. For such cases, do we try to rely entirely on change-specific concurrency unit tests? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] vigyasharma commented on a diff in pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.
vigyasharma commented on code in PR #633: URL: https://github.com/apache/lucene/pull/633#discussion_r862515847 ## lucene/test-framework/src/java/org/apache/lucene/tests/index/MockRandomMergePolicy.java: ## @@ -86,6 +86,20 @@ public MergeSpecification findMerges( return mergeSpec; } + @Override + public MergeSpecification findMerges(CodecReader... readers) throws IOException { +if (random.nextBoolean() == true) { + return super.findMerges(readers); +} else { + // create a oneMerge for each reader to let them get concurrently processed by addIndexes() Review Comment: > Please add a specific test only for concurrently merging addIndexes, but not make it the default for all uses of IndexWriter in unit tests. I have specific tests for this change - to test concurrent execution [[1]](https://github.com/apache/lucene/pull/633/files#diff-ce6b91c31190c4956eee8a734550af25047d33b92a71c3d7303eac4cc408c504R741), rollbacks and transactional behavior [[2]](https://github.com/apache/lucene/pull/633/files#diff-ce6b91c31190c4956eee8a734550af25047d33b92a71c3d7303eac4cc408c504R790), boundary conditions [[3]](https://github.com/apache/lucene/pull/633/files#diff-ce6b91c31190c4956eee8a734550af25047d33b92a71c3d7303eac4cc408c504R820), [[4]](https://github.com/apache/lucene/pull/633/files#diff-ce6b91c31190c4956eee8a734550af25047d33b92a71c3d7303eac4cc408c504R838), [[5]](https://github.com/apache/lucene/pull/633/files#diff-ce6b91c31190c4956eee8a734550af25047d33b92a71c3d7303eac4cc408c504R882), [[6]](https://github.com/apache/lucene/pull/633/files#diff-ce6b91c31190c4956eee8a734550af25047d33b92a71c3d7303eac4cc408c504R926), triggering background merges [[7]](https://github.com/apache/lucene/pull/633/files#diff- ce6b91c31190c4956eee8a734550af25047d33b92a71c3d7303eac4cc408c504R917), and changes in concurrency tests that check for IW abort, rollback and competing concurrent operations [[8]](https://github.com/apache/lucene/pull/633/files#diff-ce6b91c31190c4956eee8a734550af25047d33b92a71c3d7303eac4cc408c504R993-R994). Would love to get your inputs on any other scenarios that I should test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] vigyasharma commented on a diff in pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.
vigyasharma commented on code in PR #633: URL: https://github.com/apache/lucene/pull/633#discussion_r862516293 ## lucene/test-framework/src/java/org/apache/lucene/tests/index/MockRandomMergePolicy.java: ## @@ -86,6 +86,20 @@ public MergeSpecification findMerges( return mergeSpec; } + @Override + public MergeSpecification findMerges(CodecReader... readers) throws IOException { +if (random.nextBoolean() == true) { + return super.findMerges(readers); +} else { + // create a oneMerge for each reader to let them get concurrently processed by addIndexes() Review Comment: > Perhaps create specific tests for the ones that actually failed when this was enabled for all tests? > ...So there is definitely nothing wrong with those tests. Changes in them should be reverted! The only test that failed was `testAddIndicesWithSoftDeletes()`. It tests two different scenarios of soft deletes. One with `addIndexes(readers)` where it validates the no. of soft deletes after adding readers to an empty index. The other is with `addIndexes(wrappedReaders)` where `wrappedReaders` filter out the soft deletes, and so the resulting writer has `maxDoc == wrappedReader.numDocs()`. The test was using `writer.cloneSegmentInfos().info(0)` to count number of soft deletes in resulting index for scenario-1. This requires the index to be empty and for only a single segment to get added. I believe new segments get added at the end of the list. If the intent here is to ensure that only one segment is added, maybe we should make it something like `writer.cloneSegmentInfos().info(numSegments - 1)`? Or separately assert on the no. of new segments added? It does seem outside the purpose of this test, which is to check for soft delete retention. I [changed](https://github.com/apache/lucene/pull/633/files#diff-ce6b91c31190c4956eee8a734550af25047d33b92a71c3d7303eac4cc408c504R1796-R1800) it to use a sum of softDeleteCount from every segment, which, per my understanding, is also what [`writer.getDocStats()`](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java#L6192) does. The test passes with both concurrent and single merge addIndexes, with this change. Let me know if I should revert this to its former state. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on a diff in pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.
uschindler commented on code in PR #633: URL: https://github.com/apache/lucene/pull/633#discussion_r862518550 ## lucene/test-framework/src/java/org/apache/lucene/tests/index/MockRandomMergePolicy.java: ## @@ -86,6 +86,20 @@ public MergeSpecification findMerges( return mergeSpec; } + @Override + public MergeSpecification findMerges(CodecReader... readers) throws IOException { +if (random.nextBoolean() == true) { + return super.findMerges(readers); +} else { + // create a oneMerge for each reader to let them get concurrently processed by addIndexes() Review Comment: This single test looks still fine. You talked of multiple tests, so I was unsure which ones were changed because of your statement. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on pull request #833: LUCENE-10411: Add NN vectors support to ExitableDirectoryReader
zacharymorn commented on PR #833: URL: https://github.com/apache/lucene/pull/833#issuecomment-111430 > It's great that we're tackling this! I can't see this really hurting performance (since the checks are simple and not frequent), but am curious what you think. I don't think we benchmark `ExitableDirectoryReader` in luceneutil. Thanks for the review and comment! Hmm that's a good question. I think it should have similar performance characteristics to other methods in `ExitableDirectoryReader` given the checking logic is similar, but I'm not sure if NN methods in general might be invoked more often than those to achieve their intended operations / use cases in real applications? Other than the lack of luceneutil tests for this, there's no baseline either since this is new. I guess we could release it first and then iterate based on user feedback? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] vigyasharma commented on a diff in pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.
vigyasharma commented on code in PR #633: URL: https://github.com/apache/lucene/pull/633#discussion_r862565471 ## lucene/test-framework/src/java/org/apache/lucene/tests/index/MockRandomMergePolicy.java: ## @@ -86,6 +86,20 @@ public MergeSpecification findMerges( return mergeSpec; } + @Override + public MergeSpecification findMerges(CodecReader... readers) throws IOException { +if (random.nextBoolean() == true) { + return super.findMerges(readers); +} else { + // create a oneMerge for each reader to let them get concurrently processed by addIndexes() Review Comment: Ah, my bad. I'll update that comment. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org