Re: [PR] Limit term dictionary traversal in multi-term queries [lucene]

via GitHub Wed, 17 Jun 2026 01:23:16 -0700


iprithv commented on code in PR #16240:
URL: https://github.com/apache/lucene/pull/16240#discussion_r3426627958



##########
lucene/core/src/java/org/apache/lucene/index/FilteredTermsEnum.java:
##########
@@ -228,19 +249,20 @@ public BytesRef next() throws IOException {
       if (doSeek) {
         doSeek = false;
         final BytesRef t = nextSeekTerm(actualTerm);
-        // System.out.println("  seek to t=" + (t == null ? "null" : 
t.utf8ToString()) + " tenum=" +
-        // tenum);
         // Make sure we always seek forward:
         assert actualTerm == null || t == null || t.compareTo(actualTerm) > 0
             : "curTerm=" + actualTerm + " seekTerm=" + t;
-        if (t == null || tenum.seekCeil(t) == SeekStatus.END) {
-          // no more terms to seek to or enum exhausted
-          // System.out.println("  return null");
+        if (t == null) {
+          return null;
+        }
+        if (--visitsBudget < 0 || tenum.seekCeil(t) == SeekStatus.END) {

Review Comment:
   budget gets decremented before checking seekCeil/next, so if the enum 
naturally runs out of terms and visitsBudget happens to land on 0 at the same 
time, isVisitsBudgetExhausted() returns true even though you got all the 
terms...caller then throws away complete results and falls back to deferred 
collection for no reason
   can we have separate boolean budgetExhausted flag that only gets set when 
the budget check actually fires?



##########
lucene/core/src/java/org/apache/lucene/search/AbstractMultiTermQueryConstantScoreWrapper.java:
##########
@@ -228,62 +237,66 @@ public ScorerSupplier scorerSupplier(LeafReaderContext 
context) throws IOExcepti
       final long cost;
       final IOLongFunction<WeightOrDocIdSetIterator> weightOrIteratorSupplier;
 
-      // Only collect terms while building the ScorerSupplier when the query 
exposes a known,
-      // bounded term count (e.g. TermInSetQuery, getTermsCount() >= 0). 
There, collecting is
-      // cheap and lets us return a null supplier up-front so a parent 
BooleanQuery can
-      // short-circuit.
-      //
-      // For queries with an unknown term count (e.g. automaton queries: 
wildcard / regexp /
-      // prefix / range), collecting eagerly can scan the whole term 
dictionary during
-      // ScorerSupplier construction -- a leading wildcard such as "*foo*" 
cannot seek and must
-      // visit every term. That is supposed to be the cheap "planning" phase, 
and doing it there
-      // defeats a parent conjunction's ability to short-circuit (a sibling 
clause matching no
-      // documents can no longer skip this clause before the scan runs). So 
for an unknown term
-      // count we estimate the cost and defer term collection to 
ScorerSupplier#get().
-      if (q.getTermsCount() >= 0) {
-        List<TermAndState> collectedTerms = new ArrayList<>();
-        boolean collectResult = collectTerms(fieldDocCount, termsEnum, 
collectedTerms);
-        if (collectResult) {
-          // Return a null supplier if no query terms were in the segment:
-          if (collectedTerms.isEmpty()) {
-            return null;
-          }
+      // Try to eagerly collect matching terms. For queries with a known term 
count
+      // (e.g. TermInSetQuery), we always collect eagerly. For queries with an 
unknown term
+      // count (e.g. automaton queries: wildcard / regexp / prefix / range), 
we attempt a
+      // budgeted probe: if the automaton can find all matching terms within a 
small number of
+      // underlying TermsEnum operations, we use those results. Otherwise 
(probe exhausts its
+      // budget, or no probe is possible), we estimate the cost and defer term 
collection to
+      // ScorerSupplier#get() -- eagerly scanning the whole term dictionary 
during the
+      // "planning" phase would defeat a parent conjunction's ability to 
short-circuit.
+      List<TermAndState> eagerTerms = new ArrayList<>();
+      TermsEnum deferredTermsEnum = termsEnum;
+      boolean eagerSuccess;
 
-          // TODO: Instead of replicating the cost logic of a BooleanQuery we 
could consider
-          // rewriting to a BQ eagerly at this point and delegating to its 
cost method (instead of
-          // lazily rewriting on #get). Not sure what the performance hit 
would be of doing this
-          // though.
-          long sumTermCost = 0;
-          for (TermAndState collectedTerm : collectedTerms) {
-            sumTermCost += collectedTerm.docFreq;
-          }
-          cost = sumTermCost;
+      if (q.getTermsCount() >= 0) {
+        eagerSuccess = collectTerms(fieldDocCount, termsEnum, eagerTerms);
+      } else {
+        // Unknown term count. Try a cheap budgeted probe: if the automaton 
can find
+        // all matching terms within a small number of underlying TermsEnum 
operations,
+        // use those results eagerly. Otherwise, fall back to deferred 
collection.
+        FilteredTermsEnum probeEnum = null;
+        if (termsEnum instanceof FilteredTermsEnum fte) {
+          probeEnum = fte;
+        } else if (q instanceof AutomatonQuery aq
+            && aq.getCompiled().type == 
CompiledAutomaton.AUTOMATON_TYPE.NORMAL) {
+          probeEnum = new AutomatonTermsEnum(terms.iterator(), 
aq.getCompiled());
+        }
+        if (probeEnum != null) {
+          probeEnum.setVisitsBudget(AUTOMATON_TERM_COLLECT_VISIT_BUDGET);
+          boolean probeResult = collectTerms(fieldDocCount, probeEnum, 
eagerTerms);
+          eagerSuccess = probeResult && !probeEnum.isVisitsBudgetExhausted();
         } else {
-          cost = estimateCost(terms, q.getTermsCount());
+          eagerSuccess = false;
         }
-        weightOrIteratorSupplier =
-            leadCost -> {
-              if (collectResult) {
-                return rewriteAsBooleanQuery(context, collectedTerms);
-              } else {
-                // Too many terms to rewrite as a simple bq.
-                // Invoke rewriteInner logic to handle rewriting:
-                return rewriteInner(
-                    context, fieldDocCount, terms, termsEnum, collectedTerms, 
leadCost);
-              }
-            };
+        if (!eagerSuccess) {
+          deferredTermsEnum = (probeEnum == termsEnum) ? q.getTermsEnum(terms) 
: termsEnum;
+          eagerTerms = new ArrayList<>();

Review Comment:
   when the probe runs out of budget, this throws away whatever terms it 
already found (eagerTerms = new ArrayList<>())... those are valid matches :) 
since collectTerms appends to the list, we could keep the partial results and 
let the deferred pass continue from where the probe stopped, instead of redoing 
that work?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Limit term dictionary traversal in multi-term queries [lucene]

Reply via email to