[GitHub] [lucene] rmuir commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

GitBox Thu, 09 Dec 2021 10:35:39 -0800


rmuir commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r762372861




##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java
##########
@@ -614,16 +624,17 @@ public Automaton toAutomaton(AutomatonProvider 
automaton_provider, int determini
    */
   public Automaton toAutomaton(Map<String, Automaton> automata, int 
determinizeWorkLimit)
       throws IllegalArgumentException, TooComplexToDeterminizeException {
-    return toAutomaton(automata, null, determinizeWorkLimit);
+    return toAutomaton(automata, null, determinizeWorkLimit, true);
   }
 
   private Automaton toAutomaton(
       Map<String, Automaton> automata,
       AutomatonProvider automaton_provider,
-      int determinizeWorkLimit)
+      int determinizeWorkLimit,
+      boolean buildDFA)

Review comment:
       Similar to the other classes (AutomatonQuery, RunAutomaton, 
CompiledAutomaton), I don't think we should add booleans here.
   
   Instead, I'd rather remove `determinizeWorkLimit` and calls to `minimize()` 
everywhere from this thing. It is pretty crazy that it calls `minimize` at 
every "parsing step"!.
   
   I think `toAutomaton()` should just return an NFA, and if the caller wants 
to determinize or minimize it, that's up to them.
   
   There is an annoying twist, in that even with all these booleans for an NFA, 
determinize() still gets called if you use the "complement" operator. I'm not 
sure there is a way to implement this operator without exponential time. It is 
documented to be optional (although we enable it by default in our 
RegexpQuery), maybe we should seriously consider removing this operator?

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java
##########
@@ -614,16 +624,17 @@ public Automaton toAutomaton(AutomatonProvider 
automaton_provider, int determini
    */
   public Automaton toAutomaton(Map<String, Automaton> automata, int 
determinizeWorkLimit)
       throws IllegalArgumentException, TooComplexToDeterminizeException {
-    return toAutomaton(automata, null, determinizeWorkLimit);
+    return toAutomaton(automata, null, determinizeWorkLimit, true);
   }
 
   private Automaton toAutomaton(
       Map<String, Automaton> automata,
       AutomatonProvider automaton_provider,
-      int determinizeWorkLimit)
+      int determinizeWorkLimit,
+      boolean buildDFA)

Review comment:
       I've got a good practical solution to this, PR is coming soon. Then 
RegExp gets simple and callers can determinize/minimize if they want that.

##########
File path: 
lucene/core/src/test/org/apache/lucene/util/automaton/TestNFARunAutomaton.java
##########
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import org.apache.lucene.util.IntsRef;
+import org.apache.lucene.util.LuceneTestCase;
+
+public class TestNFARunAutomaton extends LuceneTestCase {
+
+  public void testRandom() {
+    for (int i = 0; i < 100; i++) {
+      RegExp regExp = null;
+      while (regExp == null) {
+        try {
+          regExp = new RegExp(AutomatonTestUtil.randomRegexp(random()));
+        } catch (IllegalArgumentException e) {
+          ignoreException(e);

Review comment:
       You can also annotate the `IllegalArgumentException` with 
`@SuppressWarnings("unused")`.
   
   as far as the random regexp, I think it already does what you want? One 
tricky thing is, the strings it generates are only valid if you then build 
regexp with `RegExp.NONE`. That is probably what causes confusion here.

##########
File path: 
lucene/codecs/src/java/org/apache/lucene/codecs/memory/DirectPostingsFormat.java
##########
@@ -962,15 +964,22 @@ public ImpactsEnum impacts(int flags) throws IOException {
       private int stateUpto;
 
       public DirectIntersectTermsEnum(CompiledAutomaton compiled, BytesRef 
startTerm) {
-        runAutomaton = compiled.runAutomaton;
-        compiledAutomaton = compiled;
+        if (compiled.nfaRunAutomaton != null) {
+          this.runAutomaton = compiled.nfaRunAutomaton;

Review comment:
       I too would really prefer it if we can avoid the current "if". It starts 
to look like a "dance" in all the places where we do it. I don't understand 
about how it makes debugging easier, can't we just print 
automaton.isDeterministic() ? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Reply via email to