[GitHub] [lucene] mikemccand commented on a diff in pull request #12183: Make some heavy query rewrites concurrent

via GitHub Wed, 21 Jun 2023 05:21:56 -0700


mikemccand commented on code in PR #12183:
URL: https://github.com/apache/lucene/pull/12183#discussion_r1236910632



##########
lucene/core/src/java/org/apache/lucene/index/TermStates.java:
##########
@@ -86,19 +92,58 @@ public TermStates(
    * @param needsStats if {@code true} then all leaf contexts will be visited 
up-front to collect
    *     term statistics. Otherwise, the {@link TermState} objects will be 
built only when requested
    */
-  public static TermStates build(IndexReaderContext context, Term term, 
boolean needsStats)
+  public static TermStates build(
+      IndexSearcher indexSearcher, IndexReaderContext context, Term term, 
boolean needsStats)
       throws IOException {
     assert context != null && context.isTopLevel;
     final TermStates perReaderTermState = new TermStates(needsStats ? null : 
term, context);
     if (needsStats) {
-      for (final LeafReaderContext ctx : context.leaves()) {
-        // if (DEBUG) System.out.println("  r=" + leaves[i].reader);
-        TermsEnum termsEnum = loadTermsEnum(ctx, term);
-        if (termsEnum != null) {
-          final TermState termState = termsEnum.termState();
-          // if (DEBUG) System.out.println("    found");
-          perReaderTermState.register(
-              termState, ctx.ord, termsEnum.docFreq(), 
termsEnum.totalTermFreq());
+      Executor executor = indexSearcher.getExecutor();
+      boolean isShutdown = false;
+      if (executor instanceof ExecutorService) {
+        isShutdown = ((ExecutorService) executor).isShutdown();
+      }
+      if (executor != null && isShutdown == false) {
+        // build term states concurrently
+        List<FutureTask<Integer>> tasks =
+            context.leaves().stream()
+                .map(
+                    ctx ->
+                        new FutureTask<>(
+                            () -> {
+                              TermsEnum termsEnum = loadTermsEnum(ctx, term);
+                              if (termsEnum != null) {
+                                final TermState termState = 
termsEnum.termState();
+                                perReaderTermState.register(
+                                    termState,
+                                    ctx.ord,
+                                    termsEnum.docFreq(),
+                                    termsEnum.totalTermFreq());
+                              }
+                              return 0;
+                            }))
+                .toList();
+        for (FutureTask<Integer> task : tasks) {
+          executor.execute(task);

Review Comment:
   > It would make more sense to me to parallelize at the segment level than at 
the slice level indeed since vector search has a much higher per-segment cost 
than per-document cost. I'm curious what others think about this, e.g. 
@mikemccand ? To me slicing is really about things that have a per-document 
cost, ie. running scorers / collectors.
   
   Oh that's a great point @jpountz -- the cost of terms dict lookup is 
~`O(log(max_doc))` I suspect, versus the `O(max_doc)` normal cost of visiting 
all matching hits.  So it makes sense to me to ignore the slicing and just do 
"job per segment" as you propose.
   
   And I think a similar argument applies to HNSW, though I understand its 
costs as `max_doc` grows less ... maybe it's even flatter / faster saturating 
than `log`?  So, yeah, +1 for segments not slices here too.
   
   I suppose it's likely that this is the typical cost of `rewrite`?  Do we 
have other examples?  `FuzzyQuery` is also visiting terms dict ...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mikemccand commented on a diff in pull request #12183: Make some heavy query rewrites concurrent

Reply via email to