[GitHub] [lucene] gf2121 commented on a diff in pull request #12574: Make TaskExecutor public
gf2121 commented on code in PR #12574: URL: https://github.com/apache/lucene/pull/12574#discussion_r1331468750 ## lucene/core/src/java/org/apache/lucene/search/TaskExecutor.java: ## @@ -85,11 +85,24 @@ final List invokeAll(Collection> tasks) throws IOException { return results; } - final Task createTask(Callable callable) { + /** + * Creates a task given the provided {@link Callable} + * + * @param callable the callable to be executed as part of the task + * @return the created task + * @param the return type of the task + */ + public Task createTask(Callable callable) { return new Task<>(callable); } - static class Task extends FutureTask { + /** + * Extension of {@link FutureTask} that tracks the number of tasks that are running in each + * thread. + * + * @param the return tyupe of the task Review Comment: ```suggestion * @param the return type of the task ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gf2121 opened a new pull request, #12573: Speed up sort on deleted terms
gf2121 opened a new pull request, #12573: URL: https://github.com/apache/lucene/pull/12573 ### Description Recently, we captured a flame graph in a scene with frequent updates, which showed that sorting deleted terms occupied a high CPU ratio. Currently, we use JDK sort to sort deleted terms, compare Term with Term#field, and then compare Term#bytes : ``` @Override public final int compareTo(Term other) { if (field.equals(other.field)) { return bytes.compareTo(other.bytes); } else { return field.compareTo(other.field); } } ``` In scenarios with many deleted terms, most deleted terms have the same field name. So a data structure like `Map>` instead of `Map` could be better here —— We can avoid field name compare and use `MSBRadixSort` to sort the bytes for each field. Further more, we can take advantage of `BytesRefHash` here to implement Map. This can help get a more efficient memory layout, and there's already MSBRadixSort impl in `BytesRefHash` we can reuse. ### Benchmark I run a benchmark that update on field 1,000,000 times, total took decreased 35%. https://bytedance.feishu.cn/sheets/LcVzsNpiphJeqjtIGi5c738Jnyh"; data-importRangeRawData-range="'Sheet1'!A1:C2"> https://bytedance.feishu.cn/sheets/LcVzsNpiphJeqjtIGi5c738Jnyh"; data-importRangeRawData-range="'Sheet1'!A1:D2"> | Baseline(default iw config) | Candidate (default iw config) | Took Diff -- | -- | -- | -- total took | 3402 | 2202 | -35.27% **benchmark code** ``` Directory dir = FSDirectory.open(Paths.get("./test_index")); IndexWriter w = new IndexWriter( dir, new IndexWriterConfig() .setOpenMode(IndexWriterConfig.OpenMode.CREATE) .setMergePolicy(NoMergePolicy.INSTANCE) ); long start = System.currentTimeMillis(); for (int i = 0; i < 100; i++) { Document doc = new Document(); doc.add(new StringField("id", i + "", Field.Store.NO)); Term delTerm = new Term("id", "x" + i); w.updateDocument(delTerm, doc); } long end = System.currentTimeMillis(); System.out.println(end - start); w.commit(); w.close(); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] shubhamvishu commented on pull request #12183: Make TermStates#build concurrent
shubhamvishu commented on PR #12183: URL: https://github.com/apache/lucene/pull/12183#issuecomment-1727638896 > Hey @shubhamvishu heads up: I merged #12659 to address the deadlock issue and opened #12574 to adjust TaskExecutor visibility outside of this PR. Hopefully you are next going to be able to merge your PR! Thanks a lot @javanna! I'll push the new revision post #12574 is merged. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] javanna commented on pull request #12183: Make TermStates#build concurrent
javanna commented on PR #12183: URL: https://github.com/apache/lucene/pull/12183#issuecomment-1727419929 Hey @shubhamvishu heads up: I merged #12659 to address the deadlock issue and opened #12574 to adjust TaskExecutor visibility outside of this PR. Hopefully you are next going to be able to merge your PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent closed issue #12570: Reading after Segment Merge fails for HNSW
benwtrent closed issue #12570: Reading after Segment Merge fails for HNSW URL: https://github.com/apache/lucene/issues/12570 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] javanna merged pull request #12574: Make TaskExecutor public
javanna merged PR #12574: URL: https://github.com/apache/lucene/pull/12574 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz opened a new issue, #12572: Make IndexWriter#flushNextBuffer flush deletes too?
jpountz opened a new issue, #12572: URL: https://github.com/apache/lucene/issues/12572 ### Description `IndexWriter#flushNextBuffer()` is a convenient way to control indexing buffer sizes across multiple index writers. Unfortunately, it seems that it only ever flushes DWPTs, and never deletes, even though deletes may use a very significant amount of the indexing buffer. Can we enhance `flushNextBuffer()` to flush deletes too when appropriate? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] javanna opened a new pull request, #12574: Make TaskExecutor public
javanna opened a new pull request, #12574: URL: https://github.com/apache/lucene/pull/12574 TaskExecutor is currently package private. We have scenarios where we want to parallelize the execution and reuse it outside of its package, hence this commit makes it public. Note that its constructor remains package private as it is supposed to be created by the index searcher, and later retrieved from it via the appropriate getter, which is also made public as part of this commit. This is a pre-requisite for #12183 . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zhaih closed issue #9660: ArrayIndexOutOfBoundsException in ByteBlockPool [LUCENE-8614]
zhaih closed issue #9660: ArrayIndexOutOfBoundsException in ByteBlockPool [LUCENE-8614] URL: https://github.com/apache/lucene/issues/9660 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] stefanvodita commented on issue #9660: ArrayIndexOutOfBoundsException in ByteBlockPool [LUCENE-8614]
stefanvodita commented on issue #9660: URL: https://github.com/apache/lucene/issues/9660#issuecomment-1727196480 Yes, it's resolved. Thanks, Patrick! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] javanna commented on pull request #12569: Prevent concurrent tasks from parallelizing further
javanna commented on PR #12569: URL: https://github.com/apache/lucene/pull/12569#issuecomment-1727388144 > It might be worth using CallerRunsPolicy with a small queue in tests sometimes, as this is an interesting case that will make tasks run in the current thread. Given that TaskExecutor runs in the caller thread when the counter is above zero, I was thinking that we already cover the situation where we increment the counter for the caller thread. And that has no effect other than executing in the caller thread any further task. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] kaivalnp opened a new issue, #12575: Allow implementers of AbstractKnnVectorQuery to access final topK results?
kaivalnp opened a new issue, #12575: URL: https://github.com/apache/lucene/issues/12575 ### Context Vector search is performed in [`AbstractKnnVectorQuery`](https://github.com/kaivalnp/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java), where individual HNSW searches are delegated to sub-classes via [`#approximateSearch`](https://github.com/kaivalnp/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L174). This is useful to implement custom functionality (like say pro-rating `k` across segments for performance, or requesting some additional results over `k` for higher recall with Exact KNN) While the class in itself is `package-private`, we extend the corresponding sub-classes for [byte](https://github.com/kaivalnp/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/KnnByteVectorQuery.java) and [float](https://github.com/kaivalnp/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/KnnFloatVectorQuery.java) vectors (which are `public`) for implementing any custom functionality like above ### Issue After searching across all segments, we retain the index-level `topk` results [here](https://github.com/kaivalnp/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L88), and immediately call [`#createRewrittenQuery`](https://github.com/kaivalnp/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L219) to rewrite them as a [`DocAndScoreQuery`](https://github.com/kaivalnp/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L297) So the implementing classes do not have access to the final `topK` results anywhere, which may be useful to read / modify like a post-process step (for example metric emission, or counting / only keeping results above a threshold) ### Proposal Can we make [`#createRewrittenQuery`](https://github.com/kaivalnp/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L219) `protected` to allow sub-classes to override it (and ultimately access the `topK` results)? They can simply delegate to the original function at the end, or even implement a custom `Query` if required [Example changes](https://github.com/apache/lucene/commit/f560856626a50ddc5c24d4f3b4d8b8cc4bbfb228) I don't how common of a use-case this is, and wanted to get some opinions -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] javanna commented on a diff in pull request #12574: Make TaskExecutor public
javanna commented on code in PR #12574: URL: https://github.com/apache/lucene/pull/12574#discussion_r1331497358 ## lucene/core/src/java/org/apache/lucene/search/TaskExecutor.java: ## @@ -85,11 +85,24 @@ final List invokeAll(Collection> tasks) throws IOException { return results; } - final Task createTask(Callable callable) { + /** + * Creates a task given the provided {@link Callable} + * + * @param callable the callable to be executed as part of the task + * @return the created task + * @param the return type of the task + */ + public Task createTask(Callable callable) { return new Task<>(callable); } - static class Task extends FutureTask { + /** + * Extension of {@link FutureTask} that tracks the number of tasks that are running in each + * thread. + * + * @param the return tyupe of the task Review Comment: Thanks @gf2121 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] javanna commented on a diff in pull request #12569: Prevent concurrent tasks from parallelizing further
javanna commented on code in PR #12569: URL: https://github.com/apache/lucene/pull/12569#discussion_r1331259326 ## lucene/core/src/java/org/apache/lucene/search/TaskExecutor.java: ## @@ -22,18 +22,29 @@ import java.util.Collection; import java.util.List; import java.util.Objects; +import java.util.concurrent.Callable; import java.util.concurrent.ExecutionException; import java.util.concurrent.Executor; import java.util.concurrent.Future; -import java.util.concurrent.RunnableFuture; +import java.util.concurrent.FutureTask; import org.apache.lucene.util.IOUtils; import org.apache.lucene.util.ThreadInterruptedException; /** * Executor wrapper responsible for the execution of concurrent tasks. Used to parallelize search - * across segments as well as query rewrite in some cases. + * across segments as well as query rewrite in some cases. Exposes a {@link #createTask(Callable)} + * method to create tasks given a {@link Callable}, as well as the {@link #invokeAll(Collection)} + * method to execute a set of tasks concurrently. Once all tasks are submitted to the executor, it + * blocks and wait for all tasks to be completed, and then returns a list with the obtained results. + * Ensures that the underlying executor is only used for top-level {@link #invokeAll(Collection)} + * calls, and not for potential {@link #invokeAll(Collection)} calls made from one of the tasks. + * This is to prevent deadlock with certain types of executors, as well as to limit the level of + * parallelism. */ class TaskExecutor { + // a static thread local is ok as long as there is a single TaskExecutor ever created Review Comment: Thanks all for the feedback, I added a test that uses two index searcher instances with a shared executor. This would have failed with the previous boolean approach, and succeeds with the int counter. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] javanna commented on a diff in pull request #12569: Prevent concurrent tasks from parallelizing further
javanna commented on code in PR #12569: URL: https://github.com/apache/lucene/pull/12569#discussion_r1331266192 ## lucene/core/src/java/org/apache/lucene/search/TaskExecutor.java: ## @@ -22,18 +22,28 @@ import java.util.Collection; import java.util.List; import java.util.Objects; +import java.util.concurrent.Callable; import java.util.concurrent.ExecutionException; import java.util.concurrent.Executor; import java.util.concurrent.Future; -import java.util.concurrent.RunnableFuture; +import java.util.concurrent.FutureTask; import org.apache.lucene.util.IOUtils; import org.apache.lucene.util.ThreadInterruptedException; /** * Executor wrapper responsible for the execution of concurrent tasks. Used to parallelize search - * across segments as well as query rewrite in some cases. + * across segments as well as query rewrite in some cases. Exposes a {@link #createTask(Callable)} + * method to create tasks given a {@link Callable}, as well as the {@link #invokeAll(Collection)} + * method to execute a set of tasks concurrently. Once all tasks are submitted to the executor, it + * blocks and wait for all tasks to be completed, and then returns a list with the obtained results. + * Ensures that the underlying executor is only used for top-level {@link #invokeAll(Collection)} + * calls, and not for potential {@link #invokeAll(Collection)} calls made from one of the tasks. + * This is to prevent deadlock with certain types of executors, as well as to limit the level of + * parallelism. */ class TaskExecutor { Review Comment: It's all package private so far. We can change visibility to public (for the class and getter, but not the constructor) as a follow-up -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zhaih commented on issue #9660: ArrayIndexOutOfBoundsException in ByteBlockPool [LUCENE-8614]
zhaih commented on issue #9660: URL: https://github.com/apache/lucene/issues/9660#issuecomment-1727052393 @stefanvodita Seems the issue is resolved? I closed the issue, feel free to reopen it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] javanna merged pull request #12569: Prevent concurrent tasks from parallelizing further
javanna merged PR #12569: URL: https://github.com/apache/lucene/pull/12569 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] vsop-479 commented on pull request #12528: Early terminate visit BKD leaf when current value greater than upper point in sorted dim.
vsop-479 commented on PR #12528: URL: https://github.com/apache/lucene/pull/12528#issuecomment-1727215669 @jpountz Please take a look when you get a chance! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] shubhamvishu commented on pull request #12183: Make TermStates#build concurrent
shubhamvishu commented on PR #12183: URL: https://github.com/apache/lucene/pull/12183#issuecomment-1727782959 I have rebased the PR based on the changes in #12574. Could some please take a look? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] javanna commented on pull request #12569: Prevent concurrent tasks from parallelizing further
javanna commented on PR #12569: URL: https://github.com/apache/lucene/pull/12569#issuecomment-172724 I pushed new commits to address the latest review comments, thanks for all the input. This should be ready now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] tylerbertrand opened a new pull request, #12577: Resolve CompileJava task cache miss
tylerbertrand opened a new pull request, #12577: URL: https://github.com/apache/lucene/pull/12577 ### Description Resolves `CompileJava` cache miss caused by `options.compilerArgs` input difference. Moved the `apijar` input file to a `CommandLineArgumentProvider` to apply relative path sensitivity, so that changes in the absolute path of the file do not cause a cache miss. [Task input comparison without fixes](https://ge.solutions-team.gradle.com/c/ndkin5oemsjac/wfrzfdtcomsj6/task-inputs?cacheability=cacheable,overlapping-outputs,validation-failure) shows `options.compilerArgs` input differences causing the `CompileJava` tasks to be needlessly re-executed. [Task input comparison with cache fixes applied](https://ge.solutions-team.gradle.com/c/uwmxltj3pshhe/alpaezxmm5isc/task-inputs?cacheability=cacheable,overlapping-outputs,validation-failure) shows no input differences. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] javanna opened a new pull request, #12578: Deprecate IndexSearcher#getExecutor
javanna opened a new pull request, #12578: URL: https://github.com/apache/lucene/pull/12578 We have recently introduced a TaskExecutor abstraction, which is meant to be used to execute concurrent tasks using the executor provided to the IndexSearcher constructor. All concurrenct tasks should be executed using the TaskExecutor, and there shoul be no reason to retrieve the Executor directly from the IndexSearcher. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on a diff in pull request #12569: Prevent concurrent tasks from parallelizing further
uschindler commented on code in PR #12569: URL: https://github.com/apache/lucene/pull/12569#discussion_r1331117098 ## lucene/core/src/java/org/apache/lucene/search/TaskExecutor.java: ## @@ -22,18 +22,29 @@ import java.util.Collection; import java.util.List; import java.util.Objects; +import java.util.concurrent.Callable; import java.util.concurrent.ExecutionException; import java.util.concurrent.Executor; import java.util.concurrent.Future; -import java.util.concurrent.RunnableFuture; +import java.util.concurrent.FutureTask; import org.apache.lucene.util.IOUtils; import org.apache.lucene.util.ThreadInterruptedException; /** * Executor wrapper responsible for the execution of concurrent tasks. Used to parallelize search - * across segments as well as query rewrite in some cases. + * across segments as well as query rewrite in some cases. Exposes a {@link #createTask(Callable)} + * method to create tasks given a {@link Callable}, as well as the {@link #invokeAll(Collection)} + * method to execute a set of tasks concurrently. Once all tasks are submitted to the executor, it + * blocks and wait for all tasks to be completed, and then returns a list with the obtained results. + * Ensures that the underlying executor is only used for top-level {@link #invokeAll(Collection)} + * calls, and not for potential {@link #invokeAll(Collection)} calls made from one of the tasks. + * This is to prevent deadlock with certain types of executors, as well as to limit the level of + * parallelism. */ class TaskExecutor { + // a static thread local is ok as long as there is a single TaskExecutor ever created Review Comment: Yes, static is fine. You only have to be careful with multiple index searchers all using a single ThreadPool instance (e.g., passed to many shards, each having their own IndexSearcher). So multiple IndexSearchers would share same thread pool, although all have a different instance of TaskExecutor. To workaround possible inconsistencies, I suggest to use (it was already implemented by @javanna) to increment/decrement instead of a binary switch, because it could happen that a task could execute another IndexSearcher method (e.g. also from another seracher of a different index, but using the same thread pool, e.g. to execute a paralelizzed join between indexes). If we only have a boolean instead of an integer, the state of the threadlocal after returning from the inner call would be false (instead true), so another call to s subjob would then parallize. By using a counter (as suggested above and already implemented) we are safe for such usages. Each task increments counter and decrements when done. In addition a counter would help us easily to implement parallelism by allowing the counter to go >1 before we switch to serial. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] javanna commented on a diff in pull request #12569: Prevent concurrent tasks from parallelizing further
javanna commented on code in PR #12569: URL: https://github.com/apache/lucene/pull/12569#discussion_r1331266192 ## lucene/core/src/java/org/apache/lucene/search/TaskExecutor.java: ## @@ -22,18 +22,28 @@ import java.util.Collection; import java.util.List; import java.util.Objects; +import java.util.concurrent.Callable; import java.util.concurrent.ExecutionException; import java.util.concurrent.Executor; import java.util.concurrent.Future; -import java.util.concurrent.RunnableFuture; +import java.util.concurrent.FutureTask; import org.apache.lucene.util.IOUtils; import org.apache.lucene.util.ThreadInterruptedException; /** * Executor wrapper responsible for the execution of concurrent tasks. Used to parallelize search - * across segments as well as query rewrite in some cases. + * across segments as well as query rewrite in some cases. Exposes a {@link #createTask(Callable)} + * method to create tasks given a {@link Callable}, as well as the {@link #invokeAll(Collection)} + * method to execute a set of tasks concurrently. Once all tasks are submitted to the executor, it + * blocks and wait for all tasks to be completed, and then returns a list with the obtained results. + * Ensures that the underlying executor is only used for top-level {@link #invokeAll(Collection)} + * calls, and not for potential {@link #invokeAll(Collection)} calls made from one of the tasks. + * This is to prevent deadlock with certain types of executors, as well as to limit the level of + * parallelism. */ class TaskExecutor { Review Comment: It's all package private so far. We can change visibility to public (for the class and getter, but not the constructor) when needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] kaivalnp opened a new issue, #12579: [DISCUSS] Should there be a threshold-based vector search API?
kaivalnp opened a new issue, #12579: URL: https://github.com/apache/lucene/issues/12579 ### Context Almost all [vector search algorithms](https://ann-benchmarks.com/index.html#algorithms) focus on getting the `topK` results for a given query vector. This however, may not be the best objective to pursue because: - Some queries are more suited for vector search: Indexed vectors may not be uniformly distributed, with some parts of the vector space denser than others. If the query is present in a sparse part of the space, the `topK` results will be low scoring (and not relevant to the query). We may want some kind of score `threshold`, below which a vector is not considered as a "match"? - Conversely, if the query is present in a dense part of the vector space, we may not want to limit vector search to `topK` when we have some more (slightly lower scoring) results, but still above the `threshold`. Considering a query and a vector as a "match" if their similarity score is above the `threshold`, we may want ALL possible matches? I found the same thing being discussed in #11946 as well.. However, I'm not aware of any algorithms designed for this objective (or how common of a use-case this may be) ### Proposal **IF** this turns out to be a valid use-case, HNSW may not be the best algorithm for this new objective (as it was designed for getting the K-Nearest Neighbors). On a high-level, the current algorithm is: - Maintain a queue of `topK` results (initially empty) and an unbounded queue of candidates (initially containing the entry node). Both of these are priority queues in descending order of similarity scores - Starting from the best (highest scoring) candidate, we insert it into the result queue if there are (1) fewer than `topK` results (2) it is better scoring than any existing result (also removing the least scoring result in this case) - If a node is added as a result, consider all its neighbors as further candidates - The search stops when the best scoring candidate is worse than the least scoring result For the new objective, the underlying graphs created by HNSW may still be useful - where each node is connected to its nearest (and diverse) neighbors, so we have some sense of direction / ordering on which node to explore further Can we instead change the graph traversal (and result collection) criteria such that: - Consider a node as a candidate if it exceeds a lower, graph-traversal specific score threshold - Consider a candidate as a result if it exceeds the actual, query-result similarity score threshold The lower graph-traversal threshold acts as a buffer, to retain results where all nodes along its path may not be above the actual `threshold` (and is treated as a search-time hyper-parameter, chosen according to the underlying vector space and queries) The upper levels of search will remain the same (find the single-best entry point for the next layer) and this new criteria is only applicable to the lowest layer (with all nodes) to collect an unbounded number of results above the `threshold` This will also eliminate the need to maintain results and candidates as priority queues, instead keeping them as simple lists (reducing the overhead of `heapify` calls that maintain the min-max heap property) and just sort them once at the end **Looking for inputs on both the use-case and tweaked algorithm..** -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on a diff in pull request #12577: Resolve CompileJava task cache miss
dweiss commented on code in PR #12577: URL: https://github.com/apache/lucene/pull/12577#discussion_r1331954883 ## gradle/java/core-mrjar.gradle: ## @@ -29,20 +29,19 @@ configure(project(":lucene:core")) { dependencies.add("main${jdkVersion}Implementation", sourceSets.main.output) tasks.named("compileMain${jdkVersion}Java").configure { -def apijar = new File(apijars, "jdk${jdkVersion}.apijar") - -inputs.file(apijar) +def apijar = layout.projectDirectory.file("src/generated/jdk/jdk${jdkVersion}.apijar") Review Comment: this was previously relative to 'apijars' variable; now it's fixed under src/generated... - is this intentional? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] javanna commented on pull request #12578: Deprecate IndexSearcher#getExecutor
javanna commented on PR #12578: URL: https://github.com/apache/lucene/pull/12578#issuecomment-1728197364 > I assume you will remove the deprecated method in main branch and add an entry to the MIGRATE.txt there? Yes. but I'll open a PR against main to have people double check the change. You'll get summoned there too ;) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] tylerbertrand commented on a diff in pull request #12577: Resolve CompileJava task cache miss
tylerbertrand commented on code in PR #12577: URL: https://github.com/apache/lucene/pull/12577#discussion_r1332210752 ## gradle/java/core-mrjar.gradle: ## @@ -29,20 +29,19 @@ configure(project(":lucene:core")) { dependencies.add("main${jdkVersion}Implementation", sourceSets.main.output) tasks.named("compileMain${jdkVersion}Java").configure { -def apijar = new File(apijars, "jdk${jdkVersion}.apijar") - -inputs.file(apijar) +def apijar = layout.projectDirectory.file("src/generated/jdk/jdk${jdkVersion}.apijar") Review Comment: Sort of - the lazy file API (`RegularFile` in this case) doesn't mix well with the normal, "eager" `File` type. Since other things use `apijars`, I didn't want to have these changes cascade elsewhere in the project by updating that to use the lazy file API. Turns out that was a simple update to make, so I've just pushed a commit making that update, and now the path of `apijar` is relative to the `apijars` variable once again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] tylerbertrand commented on a diff in pull request #12577: Resolve CompileJava task cache miss
tylerbertrand commented on code in PR #12577: URL: https://github.com/apache/lucene/pull/12577#discussion_r1332210752 ## gradle/java/core-mrjar.gradle: ## @@ -29,20 +29,19 @@ configure(project(":lucene:core")) { dependencies.add("main${jdkVersion}Implementation", sourceSets.main.output) tasks.named("compileMain${jdkVersion}Java").configure { -def apijar = new File(apijars, "jdk${jdkVersion}.apijar") - -inputs.file(apijar) +def apijar = layout.projectDirectory.file("src/generated/jdk/jdk${jdkVersion}.apijar") Review Comment: Sort of - Gradle's lazy file API (`RegularFile` class in this case) doesn't mix well with the normal, "eager" `File` type. Since other things use `apijars`, I didn't want to have these changes cascade elsewhere in the project by updating that to use the lazy file API. Turns out that was a simple update to make, so I've just pushed a commit making that update, and now the path of `apijar` is relative to the `apijars` variable once again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org