Copilot commented on code in PR #13072:
URL: https://github.com/apache/lucene/pull/13072#discussion_r3099052173
##########
lucene/core/src/java/org/apache/lucene/util/automaton/ByteRunnable.java:
##########
@@ -37,6 +37,14 @@ public interface ByteRunnable {
*/
boolean isAccept(int state);
+ /**
+ * Returns true if this state can accept all remaining suffixes from now on.
+ *
+ * @param state the state
+ * @return whether this state can accept all remaining suffixes.
+ */
+ boolean isMatchAllSuffix(int state);
Review Comment:
`ByteRunnable` is a public interface, and adding a new abstract method
(`isMatchAllSuffix`) is a binary/source breaking change for any external
implementations. To preserve compatibility, make this a `default` method (e.g.,
returning `false`) so existing implementors continue to work without
recompilation, while allowing optimized implementations to override it.
```suggestion
default boolean isMatchAllSuffix(int state) {
return false;
}
```
##########
lucene/core/src/java/org/apache/lucene/util/automaton/RunAutomaton.java:
##########
@@ -96,6 +101,35 @@ protected RunAutomaton(Automaton a, int alphabetSize) {
}
}
+ /** Detect whether this state can accept everything(all remaining suffixes).
*/
+ private boolean detectMatchAllSuffix(int state) {
+ assert automaton.isAccept(state);
+ Transition transition = new Transition();
+ int numTransitions = automaton.getNumTransitions(state);
+ // Apply to PrefixQuery, TermRangeQuery, custom binary Automata.
+ if (numTransitions == 1) {
+ automaton.getTransition(state, 0, transition);
+ if (transition.dest == state && transition.min == 0 && transition.max ==
alphabetSize - 1) {
+ return true;
+ }
+ }
+
+ // Apply to RegexpQuery, WildcardQuery.
+ // TODO: Is it enough just check last transition is [0, 127]?.
+ for (int i = 0; i < numTransitions; i++) {
+ automaton.getTransition(state, i, transition);
+ if (transition.min == 0 && transition.max == 127) {
+ if (transition.dest == state) {
+ return true;
+ } else if (automaton.isAccept(transition.dest)) {
+ // recurse
+ return detectMatchAllSuffix(transition.dest);
+ }
+ }
+ }
+ return false;
Review Comment:
`detectMatchAllSuffix` currently hard-codes an ASCII range check
(`transition.max == 127`) and returns true as soon as it finds a `[0,127]`
transition that loops (or reaches another accept state). This is not sufficient
to prove the state accepts *all* remaining suffixes for a DFA over
`alphabetSize` symbols, and can incorrectly mark states as `matchAllSuffix`,
causing `IntersectTermsEnum` to return terms that do not actually match the
automaton. Consider computing `matchAllSuffix` as a fixpoint: a state qualifies
iff it is accept, its outgoing transitions cover the full alphabet, and every
destination is also `matchAllSuffix` (with cycle detection / memoization to
avoid infinite recursion on cyclic graphs).
```suggestion
byte[] memo = new byte[size];
return detectMatchAllSuffix(state, memo);
}
/**
* Memoized fixpoint computation for {@code matchAllSuffix}.
*
* <p>A state matches all remaining suffixes iff it is accept, its
outgoing transitions cover the
* full alphabet with no gaps, and every destination state also matches
all remaining suffixes.
*
* <p>Memo values: 0 = unknown, 1 = true, 2 = false, 3 = visiting.
*/
private boolean detectMatchAllSuffix(int state, byte[] memo) {
if (memo[state] == 1) {
return true;
} else if (memo[state] == 2) {
return false;
} else if (memo[state] == 3) {
// We reached a cycle. Treat it as provisionally true and let the
caller finish validating
// full alphabet coverage for all states in the cycle.
return true;
}
if (automaton.isAccept(state) == false) {
memo[state] = 2;
return false;
}
memo[state] = 3;
Transition transition = new Transition();
int numTransitions = automaton.getNumTransitions(state);
int expectedMin = 0;
for (int i = 0; i < numTransitions; i++) {
automaton.getTransition(state, i, transition);
if (transition.min != expectedMin ||
detectMatchAllSuffix(transition.dest, memo) == false) {
memo[state] = 2;
return false;
}
expectedMin = transition.max + 1;
}
if (expectedMin != alphabetSize) {
memo[state] = 2;
return false;
}
memo[state] = 1;
return true;
```
##########
lucene/core/src/test/org/apache/lucene/search/TestAutomatonQuery.java:
##########
@@ -16,21 +16,19 @@
*/
package org.apache.lucene.search;
+import static org.apache.lucene.tests.util.TestUtil.alwaysPostingsFormat;
+import static org.apache.lucene.tests.util.TestUtil.getDefaultPostingsFormat;
import static
org.apache.lucene.util.automaton.Operations.DEFAULT_DETERMINIZE_WORK_LIMIT;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.concurrent.CountDownLatch;
+import org.apache.lucene.codecs.PostingsFormat;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
-import org.apache.lucene.index.IndexReader;
-import org.apache.lucene.index.MultiTerms;
-import org.apache.lucene.index.SingleTermsEnum;
-import org.apache.lucene.index.Term;
-import org.apache.lucene.index.Terms;
-import org.apache.lucene.index.TermsEnum;
+import org.apache.lucene.index.*;
Review Comment:
`import org.apache.lucene.index.*;` reduces clarity of dependencies in this
test and is inconsistent with most other tests in `org.apache.lucene.search`
which list explicit Index imports. Please switch back to explicit imports for
the specific types used in this file.
```suggestion
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
```
##########
lucene/core/src/java/org/apache/lucene/util/automaton/RunAutomaton.java:
##########
@@ -105,6 +139,8 @@ public String toString() {
b.append("state ").append(i);
if (accept.get(i)) b.append(" [accept]:\n");
else b.append(" [reject]:\n");
+ if (matchAllSuffix.get(i)) b.append(" [matchAllSuffix]:\n");
+ else b.append(" [can not matchAllSuffix]:\n");
Review Comment:
The new `toString()` output uses the phrase "can not matchAllSuffix". If
this string is intended for human readability/debugging, "cannot" is the
standard spelling (or consider omitting the negative case entirely to keep
output compact).
```suggestion
else b.append(" [cannot matchAllSuffix]:\n");
```
##########
lucene/core/src/test/org/apache/lucene/search/TestPrefixQuery.java:
##########
@@ -16,15 +16,18 @@
*/
package org.apache.lucene.search;
+import static org.apache.lucene.tests.util.TestUtil.alwaysPostingsFormat;
+import static org.apache.lucene.tests.util.TestUtil.getDefaultPostingsFormat;
+
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
+import org.apache.lucene.codecs.PostingsFormat;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
-import org.apache.lucene.index.IndexReader;
-import org.apache.lucene.index.Term;
+import org.apache.lucene.index.*;
Review Comment:
`import org.apache.lucene.index.*;` is the only wildcard import in this test
package and makes it harder to see which Index classes are actually used.
Prefer explicit imports here for consistency/readability (e.g., as done in
`TestWildcardQuery` and `TestRegexpQuery`).
```suggestion
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]