Re: [PR] Terminate automaton when it can match all suffixes, and match suffixes directly. [lucene]

via GitHub Tue, 21 Apr 2026 02:17:32 -0700


vsop-479 commented on code in PR #13072:
URL: https://github.com/apache/lucene/pull/13072#discussion_r3116325437



##########
lucene/core/src/java/org/apache/lucene/util/automaton/RunAutomaton.java:
##########
@@ -96,6 +101,35 @@ protected RunAutomaton(Automaton a, int alphabetSize) {
     }
   }
 
+  /** Detect whether this state can accept everything(all remaining suffixes). 
*/
+  private boolean detectMatchAllSuffix(int state) {
+    assert automaton.isAccept(state);
+    Transition transition = new Transition();
+    int numTransitions = automaton.getNumTransitions(state);
+    // Apply to PrefixQuery, TermRangeQuery, custom binary Automata.
+    if (numTransitions == 1) {
+      automaton.getTransition(state, 0, transition);
+      if (transition.dest == state && transition.min == 0 && transition.max == 
alphabetSize - 1) {
+        return true;
+      }
+    }
+
+    // Apply to RegexpQuery, WildcardQuery.
+    // TODO: Is it enough just check last transition is [0, 127]?.
+    for (int i = 0; i < numTransitions; i++) {
+      automaton.getTransition(state, i, transition);
+      if (transition.min == 0 && transition.max == 127) {
+        if (transition.dest == state) {
+          return true;
+        } else if (automaton.isAccept(transition.dest)) {
+          // recurse
+          return detectMatchAllSuffix(transition.dest);
+        }
+      }
+    }
+    return false;

Review Comment:
   That is incorrect, after UTF32ToUTF8 conversion, transitions intentionally 
have gaps for illegal UTF-8 byte sequences (e.g. [128, 193], surrogate range). 
These bytes will never appear in indexed terms, so gap-based rejection would 
incorrectly disqualify valid matchAllSuffix states.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Terminate automaton when it can match all suffixes, and match suffixes directly. [lucene]

Reply via email to