[GitHub] [lucene] gf2121 commented on a diff in pull request #12574: Make TaskExecutor public

2023-09-20 Thread via GitHub


gf2121 commented on code in PR #12574:
URL: https://github.com/apache/lucene/pull/12574#discussion_r1331468750


##
lucene/core/src/java/org/apache/lucene/search/TaskExecutor.java:
##
@@ -85,11 +85,24 @@ final  List invokeAll(Collection> tasks) 
throws IOException {
 return results;
   }
 
-  final  Task createTask(Callable callable) {
+  /**
+   * Creates a task given the provided {@link Callable}
+   *
+   * @param callable the callable to be executed as part of the task
+   * @return the created task
+   * @param  the return type of the task
+   */
+  public  Task createTask(Callable callable) {
 return new Task<>(callable);
   }
 
-  static class Task extends FutureTask {
+  /**
+   * Extension of {@link FutureTask} that tracks the number of tasks that are 
running in each
+   * thread.
+   *
+   * @param  the return tyupe of the task

Review Comment:
   ```suggestion
  * @param  the return type of the task
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gf2121 opened a new pull request, #12573: Speed up sort on deleted terms

2023-09-20 Thread via GitHub


gf2121 opened a new pull request, #12573:
URL: https://github.com/apache/lucene/pull/12573

   ### Description
   
   Recently, we captured a flame graph in a scene with frequent updates, which 
showed that sorting deleted terms occupied a high CPU ratio.
   
   Currently, we use JDK sort to sort deleted terms, compare Term with 
Term#field, and then compare Term#bytes :
   ```
   @Override
   public final int compareTo(Term other) {
 if (field.equals(other.field)) {
   return bytes.compareTo(other.bytes);
 } else {
   return field.compareTo(other.field);
 }
   }
   ```
   
   In scenarios with many deleted terms, most deleted terms have the same field 
name. So a data structure like `Map>` instead of 
`Map` could be better here —— We can avoid field name compare 
and use `MSBRadixSort` to sort the bytes for each field.
   
   Further more, we can take advantage of `BytesRefHash` here to implement 
Map. This can help get a more efficient memory layout, and 
there's already MSBRadixSort impl in `BytesRefHash` we can reuse.
   
   ### Benchmark
   
   I run a benchmark that update on field 1,000,000 times, total took decreased 
35%.
   
   https://bytedance.feishu.cn/sheets/LcVzsNpiphJeqjtIGi5c738Jnyh";
 data-importRangeRawData-range="'Sheet1'!A1:C2">
   
   https://bytedance.feishu.cn/sheets/LcVzsNpiphJeqjtIGi5c738Jnyh";
 data-importRangeRawData-range="'Sheet1'!A1:D2">
   
     | Baseline(default iw config) | Candidate (default iw config) | Took Diff
   -- | -- | -- | --
   total took | 3402 | 2202 | -35.27%
   
   
   
   
   
   
   
   **benchmark code**
   
   ```
   Directory dir =
   FSDirectory.open(Paths.get("./test_index"));
   IndexWriter w =
   new IndexWriter(
   dir,
   new IndexWriterConfig()
   .setOpenMode(IndexWriterConfig.OpenMode.CREATE)
   .setMergePolicy(NoMergePolicy.INSTANCE)
   );
   
   long start = System.currentTimeMillis();
   for (int i = 0; i < 100; i++) {
 Document doc = new Document();
 doc.add(new StringField("id", i + "", Field.Store.NO));
 Term delTerm = new Term("id", "x" + i);
 w.updateDocument(delTerm, doc);
   }
   long end = System.currentTimeMillis();
   System.out.println(end - start);
   
   w.commit();
   w.close();
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] shubhamvishu commented on pull request #12183: Make TermStates#build concurrent

2023-09-20 Thread via GitHub


shubhamvishu commented on PR #12183:
URL: https://github.com/apache/lucene/pull/12183#issuecomment-1727638896

   > Hey @shubhamvishu heads up: I merged #12659 to address the deadlock issue 
and opened #12574 to adjust TaskExecutor visibility outside of this PR. 
Hopefully you are next going to be able to merge your PR!
   
   Thanks a lot @javanna! I'll push the new revision post #12574 is merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna commented on pull request #12183: Make TermStates#build concurrent

2023-09-20 Thread via GitHub


javanna commented on PR #12183:
URL: https://github.com/apache/lucene/pull/12183#issuecomment-1727419929

   Hey @shubhamvishu heads up: I merged #12659 to address the deadlock issue 
and opened #12574 to adjust TaskExecutor visibility outside of this PR. 
Hopefully you are next going to be able to merge your PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent closed issue #12570: Reading after Segment Merge fails for HNSW

2023-09-20 Thread via GitHub


benwtrent closed issue #12570: Reading after Segment Merge fails for HNSW
URL: https://github.com/apache/lucene/issues/12570


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna merged pull request #12574: Make TaskExecutor public

2023-09-20 Thread via GitHub


javanna merged PR #12574:
URL: https://github.com/apache/lucene/pull/12574


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz opened a new issue, #12572: Make IndexWriter#flushNextBuffer flush deletes too?

2023-09-20 Thread via GitHub


jpountz opened a new issue, #12572:
URL: https://github.com/apache/lucene/issues/12572

   ### Description
   
   `IndexWriter#flushNextBuffer()` is a convenient way to control indexing 
buffer sizes across multiple index writers. Unfortunately, it seems that it 
only ever flushes DWPTs, and never deletes, even though deletes may use a very 
significant amount of the indexing buffer.
   
   Can we enhance `flushNextBuffer()` to flush deletes too when appropriate?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna opened a new pull request, #12574: Make TaskExecutor public

2023-09-20 Thread via GitHub


javanna opened a new pull request, #12574:
URL: https://github.com/apache/lucene/pull/12574

   TaskExecutor is currently package private. We have scenarios where we want 
to parallelize the execution and reuse it outside of its package, hence this 
commit makes it public.
   
   Note that its constructor remains package private as it is supposed to be 
created by the index searcher, and later retrieved from it via the appropriate 
getter, which is also made public as part of this commit.
   
   This is a pre-requisite for #12183 .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih closed issue #9660: ArrayIndexOutOfBoundsException in ByteBlockPool [LUCENE-8614]

2023-09-20 Thread via GitHub


zhaih closed issue #9660: ArrayIndexOutOfBoundsException in ByteBlockPool 
[LUCENE-8614]
URL: https://github.com/apache/lucene/issues/9660


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] stefanvodita commented on issue #9660: ArrayIndexOutOfBoundsException in ByteBlockPool [LUCENE-8614]

2023-09-20 Thread via GitHub


stefanvodita commented on issue #9660:
URL: https://github.com/apache/lucene/issues/9660#issuecomment-1727196480

   Yes, it's resolved. Thanks, Patrick!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna commented on pull request #12569: Prevent concurrent tasks from parallelizing further

2023-09-20 Thread via GitHub


javanna commented on PR #12569:
URL: https://github.com/apache/lucene/pull/12569#issuecomment-1727388144

   >  It might be worth using CallerRunsPolicy with a small queue in tests 
sometimes, as this is an interesting case that will make tasks run in the 
current thread.
   
   Given that TaskExecutor runs in the caller thread when the counter is above 
zero, I was thinking that we already cover the situation where we increment the 
counter for the caller thread. And that has no effect other than executing in 
the caller thread any further task.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] kaivalnp opened a new issue, #12575: Allow implementers of AbstractKnnVectorQuery to access final topK results?

2023-09-20 Thread via GitHub


kaivalnp opened a new issue, #12575:
URL: https://github.com/apache/lucene/issues/12575

   ### Context
   
   Vector search is performed in 
[`AbstractKnnVectorQuery`](https://github.com/kaivalnp/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java),
 where individual HNSW searches are delegated to sub-classes via 
[`#approximateSearch`](https://github.com/kaivalnp/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L174).
 This is useful to implement custom functionality (like say pro-rating `k` 
across segments for performance, or requesting some additional results over `k` 
for higher recall with Exact KNN)
   
   While the class in itself is `package-private`, we extend the corresponding 
sub-classes for 
[byte](https://github.com/kaivalnp/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/KnnByteVectorQuery.java)
 and 
[float](https://github.com/kaivalnp/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/KnnFloatVectorQuery.java)
 vectors (which are `public`) for implementing any custom functionality like 
above
   
   ### Issue
   
   After searching across all segments, we retain the index-level `topk` 
results 
[here](https://github.com/kaivalnp/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L88),
 and immediately call 
[`#createRewrittenQuery`](https://github.com/kaivalnp/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L219)
 to rewrite them as a 
[`DocAndScoreQuery`](https://github.com/kaivalnp/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L297)
   
   So the implementing classes do not have access to the final `topK` results 
anywhere, which may be useful to read / modify like a post-process step (for 
example metric emission, or counting / only keeping results above a threshold)
   
   ### Proposal
   
   Can we make 
[`#createRewrittenQuery`](https://github.com/kaivalnp/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L219)
 `protected` to allow sub-classes to override it (and ultimately access the 
`topK` results)?
   They can simply delegate to the original function at the end, or even 
implement a custom `Query` if required
   
   [Example 
changes](https://github.com/apache/lucene/commit/f560856626a50ddc5c24d4f3b4d8b8cc4bbfb228)
   
   I don't how common of a use-case this is, and wanted to get some opinions


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna commented on a diff in pull request #12574: Make TaskExecutor public

2023-09-20 Thread via GitHub


javanna commented on code in PR #12574:
URL: https://github.com/apache/lucene/pull/12574#discussion_r1331497358


##
lucene/core/src/java/org/apache/lucene/search/TaskExecutor.java:
##
@@ -85,11 +85,24 @@ final  List invokeAll(Collection> tasks) 
throws IOException {
 return results;
   }
 
-  final  Task createTask(Callable callable) {
+  /**
+   * Creates a task given the provided {@link Callable}
+   *
+   * @param callable the callable to be executed as part of the task
+   * @return the created task
+   * @param  the return type of the task
+   */
+  public  Task createTask(Callable callable) {
 return new Task<>(callable);
   }
 
-  static class Task extends FutureTask {
+  /**
+   * Extension of {@link FutureTask} that tracks the number of tasks that are 
running in each
+   * thread.
+   *
+   * @param  the return tyupe of the task

Review Comment:
   Thanks @gf2121 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna commented on a diff in pull request #12569: Prevent concurrent tasks from parallelizing further

2023-09-20 Thread via GitHub


javanna commented on code in PR #12569:
URL: https://github.com/apache/lucene/pull/12569#discussion_r1331259326


##
lucene/core/src/java/org/apache/lucene/search/TaskExecutor.java:
##
@@ -22,18 +22,29 @@
 import java.util.Collection;
 import java.util.List;
 import java.util.Objects;
+import java.util.concurrent.Callable;
 import java.util.concurrent.ExecutionException;
 import java.util.concurrent.Executor;
 import java.util.concurrent.Future;
-import java.util.concurrent.RunnableFuture;
+import java.util.concurrent.FutureTask;
 import org.apache.lucene.util.IOUtils;
 import org.apache.lucene.util.ThreadInterruptedException;
 
 /**
  * Executor wrapper responsible for the execution of concurrent tasks. Used to 
parallelize search
- * across segments as well as query rewrite in some cases.
+ * across segments as well as query rewrite in some cases. Exposes a {@link 
#createTask(Callable)}
+ * method to create tasks given a {@link Callable}, as well as the {@link 
#invokeAll(Collection)}
+ * method to execute a set of tasks concurrently. Once all tasks are submitted 
to the executor, it
+ * blocks and wait for all tasks to be completed, and then returns a list with 
the obtained results.
+ * Ensures that the underlying executor is only used for top-level {@link 
#invokeAll(Collection)}
+ * calls, and not for potential {@link #invokeAll(Collection)} calls made from 
one of the tasks.
+ * This is to prevent deadlock with certain types of executors, as well as to 
limit the level of
+ * parallelism.
  */
 class TaskExecutor {
+  // a static thread local is ok as long as there is a single TaskExecutor 
ever created

Review Comment:
   Thanks all for the feedback, I added a test that uses two index searcher 
instances with a shared executor. This would have failed with the previous 
boolean approach, and succeeds with the int counter.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna commented on a diff in pull request #12569: Prevent concurrent tasks from parallelizing further

2023-09-20 Thread via GitHub


javanna commented on code in PR #12569:
URL: https://github.com/apache/lucene/pull/12569#discussion_r1331266192


##
lucene/core/src/java/org/apache/lucene/search/TaskExecutor.java:
##
@@ -22,18 +22,28 @@
 import java.util.Collection;
 import java.util.List;
 import java.util.Objects;
+import java.util.concurrent.Callable;
 import java.util.concurrent.ExecutionException;
 import java.util.concurrent.Executor;
 import java.util.concurrent.Future;
-import java.util.concurrent.RunnableFuture;
+import java.util.concurrent.FutureTask;
 import org.apache.lucene.util.IOUtils;
 import org.apache.lucene.util.ThreadInterruptedException;
 
 /**
  * Executor wrapper responsible for the execution of concurrent tasks. Used to 
parallelize search
- * across segments as well as query rewrite in some cases.
+ * across segments as well as query rewrite in some cases. Exposes a {@link 
#createTask(Callable)}
+ * method to create tasks given a {@link Callable}, as well as the {@link 
#invokeAll(Collection)}
+ * method to execute a set of tasks concurrently. Once all tasks are submitted 
to the executor, it
+ * blocks and wait for all tasks to be completed, and then returns a list with 
the obtained results.
+ * Ensures that the underlying executor is only used for top-level {@link 
#invokeAll(Collection)}
+ * calls, and not for potential {@link #invokeAll(Collection)} calls made from 
one of the tasks.
+ * This is to prevent deadlock with certain types of executors, as well as to 
limit the level of
+ * parallelism.
  */
 class TaskExecutor {

Review Comment:
   It's all package private so far. We can change visibility to public (for the 
class and getter, but not the constructor) as a follow-up



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on issue #9660: ArrayIndexOutOfBoundsException in ByteBlockPool [LUCENE-8614]

2023-09-20 Thread via GitHub


zhaih commented on issue #9660:
URL: https://github.com/apache/lucene/issues/9660#issuecomment-1727052393

   @stefanvodita Seems the issue is resolved? I closed the issue, feel free to 
reopen it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna merged pull request #12569: Prevent concurrent tasks from parallelizing further

2023-09-20 Thread via GitHub


javanna merged PR #12569:
URL: https://github.com/apache/lucene/pull/12569


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] vsop-479 commented on pull request #12528: Early terminate visit BKD leaf when current value greater than upper point in sorted dim.

2023-09-20 Thread via GitHub


vsop-479 commented on PR #12528:
URL: https://github.com/apache/lucene/pull/12528#issuecomment-1727215669

   @jpountz Please take a look when you get a chance!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] shubhamvishu commented on pull request #12183: Make TermStates#build concurrent

2023-09-20 Thread via GitHub


shubhamvishu commented on PR #12183:
URL: https://github.com/apache/lucene/pull/12183#issuecomment-1727782959

   I have rebased the PR based on the changes in #12574. Could some please take 
a look? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna commented on pull request #12569: Prevent concurrent tasks from parallelizing further

2023-09-20 Thread via GitHub


javanna commented on PR #12569:
URL: https://github.com/apache/lucene/pull/12569#issuecomment-172724

   I pushed new commits to address the latest review comments, thanks for all 
the input. This should be ready now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] tylerbertrand opened a new pull request, #12577: Resolve CompileJava task cache miss

2023-09-20 Thread via GitHub


tylerbertrand opened a new pull request, #12577:
URL: https://github.com/apache/lucene/pull/12577

   ### Description
   
   Resolves `CompileJava` cache miss caused by `options.compilerArgs` input 
difference.
   
   Moved the `apijar` input file to a `CommandLineArgumentProvider` to apply 
relative path sensitivity, so that changes in the absolute path of the file do 
not cause a cache miss.
   
   [Task input comparison without 
fixes](https://ge.solutions-team.gradle.com/c/ndkin5oemsjac/wfrzfdtcomsj6/task-inputs?cacheability=cacheable,overlapping-outputs,validation-failure)
 shows `options.compilerArgs` input differences causing the `CompileJava` tasks 
to be needlessly re-executed.
   
   [Task input comparison with cache fixes 
applied](https://ge.solutions-team.gradle.com/c/uwmxltj3pshhe/alpaezxmm5isc/task-inputs?cacheability=cacheable,overlapping-outputs,validation-failure)
 shows no input differences.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna opened a new pull request, #12578: Deprecate IndexSearcher#getExecutor

2023-09-20 Thread via GitHub


javanna opened a new pull request, #12578:
URL: https://github.com/apache/lucene/pull/12578

   We have recently introduced a TaskExecutor abstraction, which is meant to be 
used to execute concurrent tasks using the executor provided to the 
IndexSearcher constructor. All concurrenct tasks should be executed using the 
TaskExecutor, and there shoul be no reason to retrieve the Executor directly 
from the IndexSearcher.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on a diff in pull request #12569: Prevent concurrent tasks from parallelizing further

2023-09-20 Thread via GitHub


uschindler commented on code in PR #12569:
URL: https://github.com/apache/lucene/pull/12569#discussion_r1331117098


##
lucene/core/src/java/org/apache/lucene/search/TaskExecutor.java:
##
@@ -22,18 +22,29 @@
 import java.util.Collection;
 import java.util.List;
 import java.util.Objects;
+import java.util.concurrent.Callable;
 import java.util.concurrent.ExecutionException;
 import java.util.concurrent.Executor;
 import java.util.concurrent.Future;
-import java.util.concurrent.RunnableFuture;
+import java.util.concurrent.FutureTask;
 import org.apache.lucene.util.IOUtils;
 import org.apache.lucene.util.ThreadInterruptedException;
 
 /**
  * Executor wrapper responsible for the execution of concurrent tasks. Used to 
parallelize search
- * across segments as well as query rewrite in some cases.
+ * across segments as well as query rewrite in some cases. Exposes a {@link 
#createTask(Callable)}
+ * method to create tasks given a {@link Callable}, as well as the {@link 
#invokeAll(Collection)}
+ * method to execute a set of tasks concurrently. Once all tasks are submitted 
to the executor, it
+ * blocks and wait for all tasks to be completed, and then returns a list with 
the obtained results.
+ * Ensures that the underlying executor is only used for top-level {@link 
#invokeAll(Collection)}
+ * calls, and not for potential {@link #invokeAll(Collection)} calls made from 
one of the tasks.
+ * This is to prevent deadlock with certain types of executors, as well as to 
limit the level of
+ * parallelism.
  */
 class TaskExecutor {
+  // a static thread local is ok as long as there is a single TaskExecutor 
ever created

Review Comment:
   Yes, static is fine.
   You only have to be careful with multiple index searchers all using a single 
ThreadPool instance (e.g., passed to many shards, each having their own 
IndexSearcher). So multiple IndexSearchers would share same thread pool, 
although all have a different instance of TaskExecutor.
   To workaround possible inconsistencies, I suggest to use (it was already 
implemented by @javanna) to increment/decrement instead of a binary switch, 
because it could happen that a task could execute another IndexSearcher method 
(e.g. also from another seracher of a different index, but using the same 
thread pool, e.g. to execute a paralelizzed join between indexes). If we only 
have a boolean instead of an integer, the state of the threadlocal after 
returning from the inner call would be false (instead true), so another call to 
s subjob would then parallize.
   By using a counter (as suggested above and already implemented) we are safe 
for such usages. Each task increments counter and decrements when done.
   In addition a counter would help us easily to implement parallelism by 
allowing the counter to go >1 before we switch to serial.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna commented on a diff in pull request #12569: Prevent concurrent tasks from parallelizing further

2023-09-20 Thread via GitHub


javanna commented on code in PR #12569:
URL: https://github.com/apache/lucene/pull/12569#discussion_r1331266192


##
lucene/core/src/java/org/apache/lucene/search/TaskExecutor.java:
##
@@ -22,18 +22,28 @@
 import java.util.Collection;
 import java.util.List;
 import java.util.Objects;
+import java.util.concurrent.Callable;
 import java.util.concurrent.ExecutionException;
 import java.util.concurrent.Executor;
 import java.util.concurrent.Future;
-import java.util.concurrent.RunnableFuture;
+import java.util.concurrent.FutureTask;
 import org.apache.lucene.util.IOUtils;
 import org.apache.lucene.util.ThreadInterruptedException;
 
 /**
  * Executor wrapper responsible for the execution of concurrent tasks. Used to 
parallelize search
- * across segments as well as query rewrite in some cases.
+ * across segments as well as query rewrite in some cases. Exposes a {@link 
#createTask(Callable)}
+ * method to create tasks given a {@link Callable}, as well as the {@link 
#invokeAll(Collection)}
+ * method to execute a set of tasks concurrently. Once all tasks are submitted 
to the executor, it
+ * blocks and wait for all tasks to be completed, and then returns a list with 
the obtained results.
+ * Ensures that the underlying executor is only used for top-level {@link 
#invokeAll(Collection)}
+ * calls, and not for potential {@link #invokeAll(Collection)} calls made from 
one of the tasks.
+ * This is to prevent deadlock with certain types of executors, as well as to 
limit the level of
+ * parallelism.
  */
 class TaskExecutor {

Review Comment:
   It's all package private so far. We can change visibility to public (for the 
class and getter, but not the constructor) when needed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] kaivalnp opened a new issue, #12579: [DISCUSS] Should there be a threshold-based vector search API?

2023-09-20 Thread via GitHub


kaivalnp opened a new issue, #12579:
URL: https://github.com/apache/lucene/issues/12579

   ### Context
   
   Almost all [vector search 
algorithms](https://ann-benchmarks.com/index.html#algorithms) focus on getting 
the `topK` results for a given query vector. This however, may not be the best 
objective to pursue because:
   
   - Some queries are more suited for vector search: Indexed vectors may not be 
uniformly distributed, with some parts of the vector space denser than others. 
If the query is present in a sparse part of the space, the `topK` results will 
be low scoring (and not relevant to the query). We may want some kind of score 
`threshold`, below which a vector is not considered as a "match"?
   - Conversely, if the query is present in a dense part of the vector space, 
we may not want to limit vector search to `topK` when we have some more 
(slightly lower scoring) results, but still above the `threshold`. Considering 
a query and a vector as a "match" if their similarity score is above the 
`threshold`, we may want ALL possible matches?
   
   I found the same thing being discussed in #11946 as well.. However, I'm not 
aware of any algorithms designed for this objective (or how common of a 
use-case this may be)
   
   ### Proposal
   
   **IF** this turns out to be a valid use-case, HNSW may not be the best 
algorithm for this new objective (as it was designed for getting the K-Nearest 
Neighbors). On a high-level, the current algorithm is:
   - Maintain a queue of `topK` results (initially empty) and an unbounded 
queue of candidates (initially containing the entry node). Both of these are 
priority queues in descending order of similarity scores
   - Starting from the best (highest scoring) candidate, we insert it into the 
result queue if there are (1) fewer than `topK` results (2) it is better 
scoring than any existing result (also removing the least scoring result in 
this case)
   - If a node is added as a result, consider all its neighbors as further 
candidates
   - The search stops when the best scoring candidate is worse than the least 
scoring result
   
   For the new objective, the underlying graphs created by HNSW may still be 
useful - where each node is connected to its nearest (and diverse) neighbors, 
so we have some sense of direction / ordering on which node to explore further
   
   Can we instead change the graph traversal (and result collection) criteria 
such that:
   - Consider a node as a candidate if it exceeds a lower, graph-traversal 
specific score threshold
   - Consider a candidate as a result if it exceeds the actual, query-result 
similarity score threshold
   
   The lower graph-traversal threshold acts as a buffer, to retain results 
where all nodes along its path may not be above the actual `threshold` (and is 
treated as a search-time hyper-parameter, chosen according to the underlying 
vector space and queries)
   
   The upper levels of search will remain the same (find the single-best entry 
point for the next layer) and this new criteria is only applicable to the 
lowest layer (with all nodes) to collect an unbounded number of results above 
the `threshold`
   
   This will also eliminate the need to maintain results and candidates as 
priority queues, instead keeping them as simple lists (reducing the overhead of 
`heapify` calls that maintain the min-max heap property) and just sort them 
once at the end
   
   **Looking for inputs on both the use-case and tweaked algorithm..**


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on a diff in pull request #12577: Resolve CompileJava task cache miss

2023-09-20 Thread via GitHub


dweiss commented on code in PR #12577:
URL: https://github.com/apache/lucene/pull/12577#discussion_r1331954883


##
gradle/java/core-mrjar.gradle:
##
@@ -29,20 +29,19 @@ configure(project(":lucene:core")) {
   dependencies.add("main${jdkVersion}Implementation", 
sourceSets.main.output)
 
   tasks.named("compileMain${jdkVersion}Java").configure {
-def apijar = new File(apijars, "jdk${jdkVersion}.apijar")
-
-inputs.file(apijar)
+def apijar = 
layout.projectDirectory.file("src/generated/jdk/jdk${jdkVersion}.apijar")

Review Comment:
   this was previously relative to 'apijars' variable; now it's fixed under 
src/generated... - is this intentional?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna commented on pull request #12578: Deprecate IndexSearcher#getExecutor

2023-09-20 Thread via GitHub


javanna commented on PR #12578:
URL: https://github.com/apache/lucene/pull/12578#issuecomment-1728197364

   > I assume you will remove the deprecated method in main branch and add an 
entry to the MIGRATE.txt there?
   
   Yes. but I'll open a PR against main to have people double check the change. 
You'll get summoned there too ;)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] tylerbertrand commented on a diff in pull request #12577: Resolve CompileJava task cache miss

2023-09-20 Thread via GitHub


tylerbertrand commented on code in PR #12577:
URL: https://github.com/apache/lucene/pull/12577#discussion_r1332210752


##
gradle/java/core-mrjar.gradle:
##
@@ -29,20 +29,19 @@ configure(project(":lucene:core")) {
   dependencies.add("main${jdkVersion}Implementation", 
sourceSets.main.output)
 
   tasks.named("compileMain${jdkVersion}Java").configure {
-def apijar = new File(apijars, "jdk${jdkVersion}.apijar")
-
-inputs.file(apijar)
+def apijar = 
layout.projectDirectory.file("src/generated/jdk/jdk${jdkVersion}.apijar")

Review Comment:
   Sort of - the lazy file API (`RegularFile` in this case) doesn't mix well 
with the normal, "eager" `File` type. Since other things use `apijars`, I 
didn't want to have these changes cascade elsewhere in the project by updating 
that to use the lazy file API. Turns out that was a simple update to make, so 
I've just pushed a commit making that update, and now the path of `apijar` is 
relative to the `apijars` variable once again.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] tylerbertrand commented on a diff in pull request #12577: Resolve CompileJava task cache miss

2023-09-20 Thread via GitHub


tylerbertrand commented on code in PR #12577:
URL: https://github.com/apache/lucene/pull/12577#discussion_r1332210752


##
gradle/java/core-mrjar.gradle:
##
@@ -29,20 +29,19 @@ configure(project(":lucene:core")) {
   dependencies.add("main${jdkVersion}Implementation", 
sourceSets.main.output)
 
   tasks.named("compileMain${jdkVersion}Java").configure {
-def apijar = new File(apijars, "jdk${jdkVersion}.apijar")
-
-inputs.file(apijar)
+def apijar = 
layout.projectDirectory.file("src/generated/jdk/jdk${jdkVersion}.apijar")

Review Comment:
   Sort of - Gradle's lazy file API (`RegularFile` class in this case) doesn't 
mix well with the normal, "eager" `File` type. Since other things use 
`apijars`, I didn't want to have these changes cascade elsewhere in the project 
by updating that to use the lazy file API. Turns out that was a simple update 
to make, so I've just pushed a commit making that update, and now the path of 
`apijar` is relative to the `apijars` variable once again.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org