rmuir commented on a change in pull request #225: URL: https://github.com/apache/lucene/pull/225#discussion_r762372861
########## File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java ########## @@ -614,16 +624,17 @@ public Automaton toAutomaton(AutomatonProvider automaton_provider, int determini */ public Automaton toAutomaton(Map<String, Automaton> automata, int determinizeWorkLimit) throws IllegalArgumentException, TooComplexToDeterminizeException { - return toAutomaton(automata, null, determinizeWorkLimit); + return toAutomaton(automata, null, determinizeWorkLimit, true); } private Automaton toAutomaton( Map<String, Automaton> automata, AutomatonProvider automaton_provider, - int determinizeWorkLimit) + int determinizeWorkLimit, + boolean buildDFA) Review comment: Similar to the other classes (AutomatonQuery, RunAutomaton, CompiledAutomaton), I don't think we should add booleans here. Instead, I'd rather remove `determinizeWorkLimit` and calls to `minimize()` everywhere from this thing. It is pretty crazy that it calls `minimize` at every "parsing step"!. I think `toAutomaton()` should just return an NFA, and if the caller wants to determinize or minimize it, that's up to them. There is an annoying twist, in that even with all these booleans for an NFA, determinize() still gets called if you use the "complement" operator. I'm not sure there is a way to implement this operator without exponential time. It is documented to be optional (although we enable it by default in our RegexpQuery), maybe we should seriously consider removing this operator? ########## File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java ########## @@ -614,16 +624,17 @@ public Automaton toAutomaton(AutomatonProvider automaton_provider, int determini */ public Automaton toAutomaton(Map<String, Automaton> automata, int determinizeWorkLimit) throws IllegalArgumentException, TooComplexToDeterminizeException { - return toAutomaton(automata, null, determinizeWorkLimit); + return toAutomaton(automata, null, determinizeWorkLimit, true); } private Automaton toAutomaton( Map<String, Automaton> automata, AutomatonProvider automaton_provider, - int determinizeWorkLimit) + int determinizeWorkLimit, + boolean buildDFA) Review comment: I've got a good practical solution to this, PR is coming soon. Then RegExp gets simple and callers can determinize/minimize if they want that. ########## File path: lucene/core/src/test/org/apache/lucene/util/automaton/TestNFARunAutomaton.java ########## @@ -0,0 +1,83 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.util.automaton; + +import java.util.Arrays; +import org.apache.lucene.util.IntsRef; +import org.apache.lucene.util.LuceneTestCase; + +public class TestNFARunAutomaton extends LuceneTestCase { + + public void testRandom() { + for (int i = 0; i < 100; i++) { + RegExp regExp = null; + while (regExp == null) { + try { + regExp = new RegExp(AutomatonTestUtil.randomRegexp(random())); + } catch (IllegalArgumentException e) { + ignoreException(e); Review comment: You can also annotate the `IllegalArgumentException` with `@SuppressWarnings("unused")`. as far as the random regexp, I think it already does what you want? One tricky thing is, the strings it generates are only valid if you then build regexp with `RegExp.NONE`. That is probably what causes confusion here. ########## File path: lucene/codecs/src/java/org/apache/lucene/codecs/memory/DirectPostingsFormat.java ########## @@ -962,15 +964,22 @@ public ImpactsEnum impacts(int flags) throws IOException { private int stateUpto; public DirectIntersectTermsEnum(CompiledAutomaton compiled, BytesRef startTerm) { - runAutomaton = compiled.runAutomaton; - compiledAutomaton = compiled; + if (compiled.nfaRunAutomaton != null) { + this.runAutomaton = compiled.nfaRunAutomaton; Review comment: I too would really prefer it if we can avoid the current "if". It starts to look like a "dance" in all the places where we do it. I don't understand about how it makes debugging easier, can't we just print automaton.isDeterministic() ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org