Re: [PR] Terminate automaton when it can match all suffixes, and match suffixes directly. [lucene]

via GitHub Fri, 17 Apr 2026 02:05:04 -0700


Copilot commented on code in PR #13072:
URL: https://github.com/apache/lucene/pull/13072#discussion_r3099052173



##########
lucene/core/src/java/org/apache/lucene/util/automaton/ByteRunnable.java:
##########
@@ -37,6 +37,14 @@ public interface ByteRunnable {
    */
   boolean isAccept(int state);
 
+  /**
+   * Returns true if this state can accept all remaining suffixes from now on.
+   *
+   * @param state the state
+   * @return whether this state can accept all remaining suffixes.
+   */
+  boolean isMatchAllSuffix(int state);

Review Comment:
   `ByteRunnable` is a public interface, and adding a new abstract method 
(`isMatchAllSuffix`) is a binary/source breaking change for any external 
implementations. To preserve compatibility, make this a `default` method (e.g., 
returning `false`) so existing implementors continue to work without 
recompilation, while allowing optimized implementations to override it.
   ```suggestion
     default boolean isMatchAllSuffix(int state) {
       return false;
     }
   ```



##########
lucene/core/src/java/org/apache/lucene/util/automaton/RunAutomaton.java:
##########
@@ -96,6 +101,35 @@ protected RunAutomaton(Automaton a, int alphabetSize) {
     }
   }
 
+  /** Detect whether this state can accept everything(all remaining suffixes). 
*/
+  private boolean detectMatchAllSuffix(int state) {
+    assert automaton.isAccept(state);
+    Transition transition = new Transition();
+    int numTransitions = automaton.getNumTransitions(state);
+    // Apply to PrefixQuery, TermRangeQuery, custom binary Automata.
+    if (numTransitions == 1) {
+      automaton.getTransition(state, 0, transition);
+      if (transition.dest == state && transition.min == 0 && transition.max == 
alphabetSize - 1) {
+        return true;
+      }
+    }
+
+    // Apply to RegexpQuery, WildcardQuery.
+    // TODO: Is it enough just check last transition is [0, 127]?.
+    for (int i = 0; i < numTransitions; i++) {
+      automaton.getTransition(state, i, transition);
+      if (transition.min == 0 && transition.max == 127) {
+        if (transition.dest == state) {
+          return true;
+        } else if (automaton.isAccept(transition.dest)) {
+          // recurse
+          return detectMatchAllSuffix(transition.dest);
+        }
+      }
+    }
+    return false;

Review Comment:
   `detectMatchAllSuffix` currently hard-codes an ASCII range check 
(`transition.max == 127`) and returns true as soon as it finds a `[0,127]` 
transition that loops (or reaches another accept state). This is not sufficient 
to prove the state accepts *all* remaining suffixes for a DFA over 
`alphabetSize` symbols, and can incorrectly mark states as `matchAllSuffix`, 
causing `IntersectTermsEnum` to return terms that do not actually match the 
automaton. Consider computing `matchAllSuffix` as a fixpoint: a state qualifies 
iff it is accept, its outgoing transitions cover the full alphabet, and every 
destination is also `matchAllSuffix` (with cycle detection / memoization to 
avoid infinite recursion on cyclic graphs).
   ```suggestion
       byte[] memo = new byte[size];
       return detectMatchAllSuffix(state, memo);
     }
   
     /**
      * Memoized fixpoint computation for {@code matchAllSuffix}.
      *
      * <p>A state matches all remaining suffixes iff it is accept, its 
outgoing transitions cover the
      * full alphabet with no gaps, and every destination state also matches 
all remaining suffixes.
      *
      * <p>Memo values: 0 = unknown, 1 = true, 2 = false, 3 = visiting.
      */
     private boolean detectMatchAllSuffix(int state, byte[] memo) {
       if (memo[state] == 1) {
         return true;
       } else if (memo[state] == 2) {
         return false;
       } else if (memo[state] == 3) {
         // We reached a cycle. Treat it as provisionally true and let the 
caller finish validating
         // full alphabet coverage for all states in the cycle.
         return true;
       }
   
       if (automaton.isAccept(state) == false) {
         memo[state] = 2;
         return false;
       }
   
       memo[state] = 3;
   
       Transition transition = new Transition();
       int numTransitions = automaton.getNumTransitions(state);
       int expectedMin = 0;
       for (int i = 0; i < numTransitions; i++) {
         automaton.getTransition(state, i, transition);
         if (transition.min != expectedMin || 
detectMatchAllSuffix(transition.dest, memo) == false) {
           memo[state] = 2;
           return false;
         }
         expectedMin = transition.max + 1;
       }
   
       if (expectedMin != alphabetSize) {
         memo[state] = 2;
         return false;
       }
   
       memo[state] = 1;
       return true;
   ```



##########
lucene/core/src/test/org/apache/lucene/search/TestAutomatonQuery.java:
##########
@@ -16,21 +16,19 @@
  */
 package org.apache.lucene.search;
 
+import static org.apache.lucene.tests.util.TestUtil.alwaysPostingsFormat;
+import static org.apache.lucene.tests.util.TestUtil.getDefaultPostingsFormat;
 import static 
org.apache.lucene.util.automaton.Operations.DEFAULT_DETERMINIZE_WORK_LIMIT;
 
 import java.io.IOException;
 import java.util.ArrayList;
 import java.util.Collections;
 import java.util.List;
 import java.util.concurrent.CountDownLatch;
+import org.apache.lucene.codecs.PostingsFormat;
 import org.apache.lucene.document.Document;
 import org.apache.lucene.document.Field;
-import org.apache.lucene.index.IndexReader;
-import org.apache.lucene.index.MultiTerms;
-import org.apache.lucene.index.SingleTermsEnum;
-import org.apache.lucene.index.Term;
-import org.apache.lucene.index.Terms;
-import org.apache.lucene.index.TermsEnum;
+import org.apache.lucene.index.*;

Review Comment:
   `import org.apache.lucene.index.*;` reduces clarity of dependencies in this 
test and is inconsistent with most other tests in `org.apache.lucene.search` 
which list explicit Index imports. Please switch back to explicit imports for 
the specific types used in this file.
   ```suggestion
   import org.apache.lucene.index.DirectoryReader;
   import org.apache.lucene.index.IndexReader;
   import org.apache.lucene.index.IndexWriter;
   import org.apache.lucene.index.Term;
   ```



##########
lucene/core/src/java/org/apache/lucene/util/automaton/RunAutomaton.java:
##########
@@ -105,6 +139,8 @@ public String toString() {
       b.append("state ").append(i);
       if (accept.get(i)) b.append(" [accept]:\n");
       else b.append(" [reject]:\n");
+      if (matchAllSuffix.get(i)) b.append(" [matchAllSuffix]:\n");
+      else b.append(" [can not matchAllSuffix]:\n");

Review Comment:
   The new `toString()` output uses the phrase "can not matchAllSuffix". If 
this string is intended for human readability/debugging, "cannot" is the 
standard spelling (or consider omitting the negative case entirely to keep 
output compact).
   ```suggestion
         else b.append(" [cannot matchAllSuffix]:\n");
   ```



##########
lucene/core/src/test/org/apache/lucene/search/TestPrefixQuery.java:
##########
@@ -16,15 +16,18 @@
  */
 package org.apache.lucene.search;
 
+import static org.apache.lucene.tests.util.TestUtil.alwaysPostingsFormat;
+import static org.apache.lucene.tests.util.TestUtil.getDefaultPostingsFormat;
+
 import java.util.ArrayList;
 import java.util.Collections;
 import java.util.HashSet;
 import java.util.List;
 import java.util.Set;
+import org.apache.lucene.codecs.PostingsFormat;
 import org.apache.lucene.document.Document;
 import org.apache.lucene.document.Field;
-import org.apache.lucene.index.IndexReader;
-import org.apache.lucene.index.Term;
+import org.apache.lucene.index.*;

Review Comment:
   `import org.apache.lucene.index.*;` is the only wildcard import in this test 
package and makes it harder to see which Index classes are actually used. 
Prefer explicit imports here for consistency/readability (e.g., as done in 
`TestWildcardQuery` and `TestRegexpQuery`).
   ```suggestion
   import org.apache.lucene.index.IndexReader;
   import org.apache.lucene.index.Term;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Terminate automaton when it can match all suffixes, and match suffixes directly. [lucene]

Reply via email to