[jira] [Commented] (LUCENE-10216) Add concurrency to addIndexes(CodecReader…) API

2022-05-01 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530504#comment-17530504
 ] 

Robert Muir commented on LUCENE-10216:
--

{quote}
I've also modified the MockRandomMergePolicy to randomly pick a highly 
concurrent, (one segment per reader), findMerges(...) implementation 50% of the 
time.
{quote}

Is this going to make 50% of our test runs non-reproducible? Please, let's 
consider the need to reproduce failing runs and not pick such highly concurrent 
stuff for the unit tests. We should be able to test it another way.

> Add concurrency to addIndexes(CodecReader…) API
> ---
>
> Key: LUCENE-10216
> URL: https://issues.apache.org/jira/browse/LUCENE-10216
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Vigya Sharma
>Priority: Major
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> I work at Amazon Product Search, and we use Lucene to power search for the 
> e-commerce platform. I’m working on a project that involves applying 
> metadata+ETL transforms and indexing documents on n different _indexing_ 
> boxes, combining them into a single index on a separate _reducer_ box, and 
> making it available for queries on m different _search_ boxes (replicas). 
> Segments are asynchronously copied from indexers to reducers to searchers as 
> they become available for the next layer to consume.
> I am using the addIndexes API to combine multiple indexes into one on the 
> reducer boxes. Since we also have taxonomy data, we need to remap facet field 
> ordinals, which means I need to use the {{addIndexes(CodecReader…)}} version 
> of this API. The API leverages {{SegmentMerger.merge()}} to create segments 
> with new ordinal values while also merging all provided segments in the 
> process.
> _This is however a blocking call that runs in a single thread._ Until we have 
> written segments with new ordinal values, we cannot copy them to searcher 
> boxes, which increases the time to make documents available for search.
> I was playing around with the API by creating multiple concurrent merges, 
> each with only a single reader, creating a concurrently running 1:1 
> conversion from old segments to new ones (with new ordinal values). We follow 
> this up with non-blocking background merges. This lets us copy the segments 
> to searchers and replicas as soon as they are available, and later replace 
> them with merged segments as background jobs complete. On the Amazon dataset 
> I profiled, this gave us around 2.5 to 3x improvement in addIndexes() time. 
> Each call was given about 5 readers to add on average.
> This might be useful add to Lucene. We could create another {{addIndexes()}} 
> API with a {{boolean}} flag for concurrency, that internally submits multiple 
> merge jobs (each with a single reader) to the {{ConcurrentMergeScheduler}}, 
> and waits for them to complete before returning.
> While this is doable from outside Lucene by using your thread pool, starting 
> multiple addIndexes() calls and waiting for them to complete, I felt it needs 
> some understanding of what addIndexes does, why you need to wait on the merge 
> and why it makes sense to pass a single reader in the addIndexes API.
> Out of box support in Lucene could simplify this for folks a similar use case.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a diff in pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

2022-05-01 Thread GitBox


rmuir commented on code in PR #633:
URL: https://github.com/apache/lucene/pull/633#discussion_r862451702


##
lucene/test-framework/src/java/org/apache/lucene/tests/index/MockRandomMergePolicy.java:
##
@@ -86,6 +86,20 @@ public MergeSpecification findMerges(
 return mergeSpec;
   }
 
+  @Override
+  public MergeSpecification findMerges(CodecReader... readers) throws 
IOException {
+if (random.nextBoolean() == true) {
+  return super.findMerges(readers);
+} else {
+  // create a oneMerge for each reader to let them get concurrently 
processed by addIndexes()

Review Comment:
   Can we do this much much less often, possibly only at nightly, or not at all?
   
   It is very easy to transition from "repeatable randomized tests" to "chaotic 
flaky non-reproducing madness" by adding a bunch of concurrency everywhere. 
It's not the way.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on a diff in pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

2022-05-01 Thread GitBox


uschindler commented on code in PR #633:
URL: https://github.com/apache/lucene/pull/633#discussion_r862455421


##
lucene/test-framework/src/java/org/apache/lucene/tests/index/MockRandomMergePolicy.java:
##
@@ -86,6 +86,20 @@ public MergeSpecification findMerges(
 return mergeSpec;
   }
 
+  @Override
+  public MergeSpecification findMerges(CodecReader... readers) throws 
IOException {
+if (random.nextBoolean() == true) {
+  return super.findMerges(readers);
+} else {
+  // create a oneMerge for each reader to let them get concurrently 
processed by addIndexes()

Review Comment:
   Yes please don't do this for all tests. Not even nightly.
   
   Please add a specific test only for concurrently merging addindexes, bit not 
make it the default for all uses of IndexWriter in unit tests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on a diff in pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

2022-05-01 Thread GitBox


uschindler commented on code in PR #633:
URL: https://github.com/apache/lucene/pull/633#discussion_r862455421


##
lucene/test-framework/src/java/org/apache/lucene/tests/index/MockRandomMergePolicy.java:
##
@@ -86,6 +86,20 @@ public MergeSpecification findMerges(
 return mergeSpec;
   }
 
+  @Override
+  public MergeSpecification findMerges(CodecReader... readers) throws 
IOException {
+if (random.nextBoolean() == true) {
+  return super.findMerges(readers);
+} else {
+  // create a oneMerge for each reader to let them get concurrently 
processed by addIndexes()

Review Comment:
   Yes please don't do this for all tests. Not even nightly.
   
   Please add a specific test only for concurrently merging addIndexes, but not 
make it the default for all uses of IndexWriter in unit tests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a diff in pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

2022-05-01 Thread GitBox


msokolov commented on code in PR #633:
URL: https://github.com/apache/lucene/pull/633#discussion_r862485790


##
lucene/test-framework/src/java/org/apache/lucene/tests/index/MockRandomMergePolicy.java:
##
@@ -86,6 +86,20 @@ public MergeSpecification findMerges(
 return mergeSpec;
   }
 
+  @Override
+  public MergeSpecification findMerges(CodecReader... readers) throws 
IOException {
+if (random.nextBoolean() == true) {
+  return super.findMerges(readers);
+} else {
+  // create a oneMerge for each reader to let them get concurrently 
processed by addIndexes()

Review Comment:
   Perhaps create specific tests for the ones that actually failed when this 
was enabled for all tests?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on a diff in pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

2022-05-01 Thread GitBox


uschindler commented on code in PR #633:
URL: https://github.com/apache/lucene/pull/633#discussion_r862504146


##
lucene/test-framework/src/java/org/apache/lucene/tests/index/MockRandomMergePolicy.java:
##
@@ -86,6 +86,20 @@ public MergeSpecification findMerges(
 return mergeSpec;
   }
 
+  @Override
+  public MergeSpecification findMerges(CodecReader... readers) throws 
IOException {
+if (random.nextBoolean() == true) {
+  return super.findMerges(readers);
+} else {
+  // create a oneMerge for each reader to let them get concurrently 
processed by addIndexes()

Review Comment:
   I don't think those tests are wrong. The tests are there to test specific 
merging behaviour. If they only expect one segment they should only get one 
segment. Those tests should not suddenly get a concurrent addIndexes behaviour. 
That's not what they assert. So there is definitely nothing wrong with those 
tests. Changes in them should be reverted!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on a diff in pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

2022-05-01 Thread GitBox


uschindler commented on code in PR #633:
URL: https://github.com/apache/lucene/pull/633#discussion_r862504930


##
lucene/test-framework/src/java/org/apache/lucene/tests/index/MockRandomMergePolicy.java:
##
@@ -86,6 +86,20 @@ public MergeSpecification findMerges(
 return mergeSpec;
   }
 
+  @Override
+  public MergeSpecification findMerges(CodecReader... readers) throws 
IOException {
+if (random.nextBoolean() == true) {
+  return super.findMerges(readers);
+} else {
+  // create a oneMerge for each reader to let them get concurrently 
processed by addIndexes()

Review Comment:
   But there should be an additional test that checks if concurrent addIndexes 
works as expected.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] vigyasharma commented on a diff in pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

2022-05-01 Thread GitBox


vigyasharma commented on code in PR #633:
URL: https://github.com/apache/lucene/pull/633#discussion_r862515687


##
lucene/test-framework/src/java/org/apache/lucene/tests/index/MockRandomMergePolicy.java:
##
@@ -86,6 +86,20 @@ public MergeSpecification findMerges(
 return mergeSpec;
   }
 
+  @Override
+  public MergeSpecification findMerges(CodecReader... readers) throws 
IOException {
+if (random.nextBoolean() == true) {
+  return super.findMerges(readers);
+} else {
+  // create a oneMerge for each reader to let them get concurrently 
processed by addIndexes()

Review Comment:
   Thanks for reviewing this @rmuir, @uschindler. I understand your concern. 
Failures on concurrent code are already tough to reproduce; it gets much harder 
to do this from a randomized test seed. 
   
   It is worth noting that the `findMerges(CodecReader[])` API modified in 
`MockRandomMergePolicy`, is only invoked within the 
`addIndexes(CodecReaders[])` API, which should contain its scope of impact. But 
I definitely don't want to add noise with random failures on distantly related 
tests. I'll remove this new policy from `MockRandomMergePolicy`, which is 
frequently used for `IndexWriterConfig` in multiple tests.
   
   For my awareness, what is the best practice in Lucene for testing 
non-default, concurrency related code paths? Randomized tests help validate 
infrequently triggered code, but concurrency poses has reproducibility 
challenges. For such cases, do we try to rely entirely on change-specific 
concurrency unit tests? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] vigyasharma commented on a diff in pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

2022-05-01 Thread GitBox


vigyasharma commented on code in PR #633:
URL: https://github.com/apache/lucene/pull/633#discussion_r862515847


##
lucene/test-framework/src/java/org/apache/lucene/tests/index/MockRandomMergePolicy.java:
##
@@ -86,6 +86,20 @@ public MergeSpecification findMerges(
 return mergeSpec;
   }
 
+  @Override
+  public MergeSpecification findMerges(CodecReader... readers) throws 
IOException {
+if (random.nextBoolean() == true) {
+  return super.findMerges(readers);
+} else {
+  // create a oneMerge for each reader to let them get concurrently 
processed by addIndexes()

Review Comment:
   > Please add a specific test only for concurrently merging addIndexes, but 
not make it the default for all uses of IndexWriter in unit tests.
   
   I have specific tests for this change - to test concurrent execution 
[[1]](https://github.com/apache/lucene/pull/633/files#diff-ce6b91c31190c4956eee8a734550af25047d33b92a71c3d7303eac4cc408c504R741),
 rollbacks and transactional behavior 
[[2]](https://github.com/apache/lucene/pull/633/files#diff-ce6b91c31190c4956eee8a734550af25047d33b92a71c3d7303eac4cc408c504R790),
 boundary conditions 
[[3]](https://github.com/apache/lucene/pull/633/files#diff-ce6b91c31190c4956eee8a734550af25047d33b92a71c3d7303eac4cc408c504R820),
 
[[4]](https://github.com/apache/lucene/pull/633/files#diff-ce6b91c31190c4956eee8a734550af25047d33b92a71c3d7303eac4cc408c504R838),
 
[[5]](https://github.com/apache/lucene/pull/633/files#diff-ce6b91c31190c4956eee8a734550af25047d33b92a71c3d7303eac4cc408c504R882),
 
[[6]](https://github.com/apache/lucene/pull/633/files#diff-ce6b91c31190c4956eee8a734550af25047d33b92a71c3d7303eac4cc408c504R926),
  triggering background merges 
[[7]](https://github.com/apache/lucene/pull/633/files#diff-
 ce6b91c31190c4956eee8a734550af25047d33b92a71c3d7303eac4cc408c504R917), and 
changes in concurrency tests that check for IW abort, rollback and competing 
concurrent operations 
[[8]](https://github.com/apache/lucene/pull/633/files#diff-ce6b91c31190c4956eee8a734550af25047d33b92a71c3d7303eac4cc408c504R993-R994).
 
   
   Would love to get your inputs on any other scenarios that I should test.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] vigyasharma commented on a diff in pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

2022-05-01 Thread GitBox


vigyasharma commented on code in PR #633:
URL: https://github.com/apache/lucene/pull/633#discussion_r862516293


##
lucene/test-framework/src/java/org/apache/lucene/tests/index/MockRandomMergePolicy.java:
##
@@ -86,6 +86,20 @@ public MergeSpecification findMerges(
 return mergeSpec;
   }
 
+  @Override
+  public MergeSpecification findMerges(CodecReader... readers) throws 
IOException {
+if (random.nextBoolean() == true) {
+  return super.findMerges(readers);
+} else {
+  // create a oneMerge for each reader to let them get concurrently 
processed by addIndexes()

Review Comment:
   > Perhaps create specific tests for the ones that actually failed when this 
was enabled for all tests?
   
   > ...So there is definitely nothing wrong with those tests. Changes in them 
should be reverted!
   
   
   The only test that failed was `testAddIndicesWithSoftDeletes()`. It tests 
two different scenarios of soft deletes. One with `addIndexes(readers)` where 
it validates the no. of soft deletes after adding readers to an empty index. 
   The other is with `addIndexes(wrappedReaders)` where `wrappedReaders` filter 
out the soft deletes, and so the resulting writer has `maxDoc == 
wrappedReader.numDocs()`.
   
   The test was using `writer.cloneSegmentInfos().info(0)` to count number of 
soft deletes in resulting index for scenario-1. This requires the index to be 
empty and for only a single segment to get added.  I believe new segments get 
added at the end of the list. If the intent here is to ensure that only one 
segment is added, maybe we should make it something like 
`writer.cloneSegmentInfos().info(numSegments - 1)`? Or separately assert on the 
no. of new segments added? It does seem outside the purpose of this test, which 
is to check for soft delete retention.
   
   I 
[changed](https://github.com/apache/lucene/pull/633/files#diff-ce6b91c31190c4956eee8a734550af25047d33b92a71c3d7303eac4cc408c504R1796-R1800)
 it to use a sum of softDeleteCount from every segment, which, per my 
understanding, is also what 
[`writer.getDocStats()`](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java#L6192)
 does. The test passes with both concurrent and single merge addIndexes, with 
this change. Let me know if I should revert this to its former state.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on a diff in pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

2022-05-01 Thread GitBox


uschindler commented on code in PR #633:
URL: https://github.com/apache/lucene/pull/633#discussion_r862518550


##
lucene/test-framework/src/java/org/apache/lucene/tests/index/MockRandomMergePolicy.java:
##
@@ -86,6 +86,20 @@ public MergeSpecification findMerges(
 return mergeSpec;
   }
 
+  @Override
+  public MergeSpecification findMerges(CodecReader... readers) throws 
IOException {
+if (random.nextBoolean() == true) {
+  return super.findMerges(readers);
+} else {
+  // create a oneMerge for each reader to let them get concurrently 
processed by addIndexes()

Review Comment:
   This single test looks still fine. You talked of multiple tests, so I was 
unsure which ones were changed because of your statement.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on pull request #833: LUCENE-10411: Add NN vectors support to ExitableDirectoryReader

2022-05-01 Thread GitBox


zacharymorn commented on PR #833:
URL: https://github.com/apache/lucene/pull/833#issuecomment-111430

   > It's great that we're tackling this! I can't see this really hurting 
performance (since the checks are simple and not frequent), but am curious what 
you think. I don't think we benchmark `ExitableDirectoryReader` in luceneutil.
   
   Thanks for the review and comment! Hmm that's a good question. I think it 
should have similar performance characteristics to other methods in 
`ExitableDirectoryReader` given the checking logic is similar, but I'm not sure 
if NN methods in general might be invoked more often than those to achieve 
their intended operations / use cases in real applications? Other than the lack 
of luceneutil tests for this, there's no baseline either since this is new. I 
guess we could release it first and then iterate based on user feedback?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] vigyasharma commented on a diff in pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

2022-05-01 Thread GitBox


vigyasharma commented on code in PR #633:
URL: https://github.com/apache/lucene/pull/633#discussion_r862565471


##
lucene/test-framework/src/java/org/apache/lucene/tests/index/MockRandomMergePolicy.java:
##
@@ -86,6 +86,20 @@ public MergeSpecification findMerges(
 return mergeSpec;
   }
 
+  @Override
+  public MergeSpecification findMerges(CodecReader... readers) throws 
IOException {
+if (random.nextBoolean() == true) {
+  return super.findMerges(readers);
+} else {
+  // create a oneMerge for each reader to let them get concurrently 
processed by addIndexes()

Review Comment:
   Ah, my bad. I'll update that comment. Thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org