vsop-479 commented on code in PR #13072:
URL: https://github.com/apache/lucene/pull/13072#discussion_r3116325437
##########
lucene/core/src/java/org/apache/lucene/util/automaton/RunAutomaton.java:
##########
@@ -96,6 +101,35 @@ protected RunAutomaton(Automaton a, int alphabetSize) {
}
}
+ /** Detect whether this state can accept everything(all remaining suffixes).
*/
+ private boolean detectMatchAllSuffix(int state) {
+ assert automaton.isAccept(state);
+ Transition transition = new Transition();
+ int numTransitions = automaton.getNumTransitions(state);
+ // Apply to PrefixQuery, TermRangeQuery, custom binary Automata.
+ if (numTransitions == 1) {
+ automaton.getTransition(state, 0, transition);
+ if (transition.dest == state && transition.min == 0 && transition.max ==
alphabetSize - 1) {
+ return true;
+ }
+ }
+
+ // Apply to RegexpQuery, WildcardQuery.
+ // TODO: Is it enough just check last transition is [0, 127]?.
+ for (int i = 0; i < numTransitions; i++) {
+ automaton.getTransition(state, i, transition);
+ if (transition.min == 0 && transition.max == 127) {
+ if (transition.dest == state) {
+ return true;
+ } else if (automaton.isAccept(transition.dest)) {
+ // recurse
+ return detectMatchAllSuffix(transition.dest);
+ }
+ }
+ }
+ return false;
Review Comment:
That is incorrect, after UTF32ToUTF8 conversion, transitions intentionally
have gaps for illegal UTF-8 byte sequences (e.g. [128, 193], surrogate range).
These bytes will never appear in indexed terms, so gap-based rejection would
incorrectly disqualify valid matchAllSuffix states.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]