[GitHub] [lucene-solr] jpountz commented on a change in pull request #1541: RegExp - add case insensitive matching option

GitBox Tue, 16 Jun 2020 10:00:16 -0700


jpountz commented on a change in pull request #1541:
URL: https://github.com/apache/lucene-solr/pull/1541#discussion_r440900284




##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java
##########
@@ -489,6 +497,19 @@ public RegExp(String s) throws IllegalArgumentException {
     this(s, ALL);
   }
   
+  /**
+   * Constructs new <code>RegExp</code> from a string. Same as
+   * <code>RegExp(s, ALL)</code>.
+   * 
+   * @param s regexp string
+   * @param caseSensitive case sensitive matching
+   * @exception IllegalArgumentException if an error occurred while parsing the
+   *              regular expression
+   */
+  public RegExp(String s, boolean caseSensitive) throws 
IllegalArgumentException {
+    this(s, ALL, caseSensitive);
+  }  

Review comment:
       same here, maybe it's fine to require passing syntax flags when you want 
to configure case sensitivity?

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java
##########
@@ -499,10 +520,30 @@ public RegExp(String s) throws IllegalArgumentException {
    *              regular expression
    */
   public RegExp(String s, int syntax_flags) throws IllegalArgumentException {
+    this(s, syntax_flags, true);
+  }
+  /**
+   * Constructs new <code>RegExp</code> from a string.
+   * 
+   * @param s regexp string
+   * @param syntax_flags boolean 'or' of optional syntax constructs to be
+   *          enabled
+   * @param caseSensitive case sensitive matching
+   * @exception IllegalArgumentException if an error occurred while parsing the
+   *              regular expression
+   */
+  public RegExp(String s, int syntax_flags, boolean caseSensitive) throws 
IllegalArgumentException {    
     originalString = s;
-    flags = syntax_flags;
+    // Trim any bits unrelated to syntax flags
+    syntax_flags  = syntax_flags & 0xff;
+    if (caseSensitive) {
+      flags = syntax_flags;
+    } else {      
+      // Add in the case-insensitive setting
+      flags = syntax_flags  | UNICODE_CASE_INSENSITIVE;

Review comment:
       ```suggestion
         flags = syntax_flags | UNICODE_CASE_INSENSITIVE;
   ```

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java
##########
@@ -743,6 +792,30 @@ private Automaton 
toAutomatonInternal(Map<String,Automaton> automata,
     }
     return a;
   }
+  private Automaton toCaseInsensitiveChar(int codepoint, int 
maxDeterminizedStates) {
+    Automaton case1 = Automata.makeChar(codepoint);
+    int altCase = Character.isLowerCase(codepoint) ? 
Character.toUpperCase(codepoint) : Character.toLowerCase(codepoint);
+    Automaton result;
+    if (altCase != codepoint) {
+      result = Operations.union(case1, Automata.makeChar(altCase));
+      result = MinimizationOperations.minimize(result, maxDeterminizedStates); 
         
+    } else {
+      result = case1;                      
+    }          
+    return result;
+  }
+  
+  private Automaton toCaseInsensitiveString(int maxDeterminizedStates) {
+    List<Automaton> list = new ArrayList<>();
+    s.codePoints().forEach(
+        p -> {
+          list.add(toCaseInsensitiveChar(p, maxDeterminizedStates));
+        }
+    );

Review comment:
       I'd rather like a regular for loop, this is a bit abusing lambdas to my 
taste. :)

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java
##########
@@ -743,6 +792,30 @@ private Automaton 
toAutomatonInternal(Map<String,Automaton> automata,
     }
     return a;
   }
+  private Automaton toCaseInsensitiveChar(int codepoint, int 
maxDeterminizedStates) {
+    Automaton case1 = Automata.makeChar(codepoint);
+    int altCase = Character.isLowerCase(codepoint) ? 
Character.toUpperCase(codepoint) : Character.toLowerCase(codepoint);
+    Automaton result;
+    if (altCase != codepoint) {
+      result = Operations.union(case1, Automata.makeChar(altCase));
+      result = MinimizationOperations.minimize(result, maxDeterminizedStates); 
         
+    } else {
+      result = case1;                      
+    }          
+    return result;
+  }

Review comment:
       I think that this is incorrect as there is no 1:1 mapping between 
lowercase and uppercase letters, for instance `ς` and `σ` both have `Σ` as 
their uppercase variant. And if someone uses `Σ` in their regexes, `ς` wouldn't 
match as `toLowerCase(Σ)` returns `σ`.
   
   Should we make this only about ASCII for now, like Java's Pattern class? 
https://docs.oracle.com/en/java/javase/13/docs/api/java.base/java/util/regex/Pattern.html#CASE_INSENSITIVE
 We could add support for full Unicode later but this doesn't look like a low 
hanging fruit to me?

##########
File path: lucene/core/src/java/org/apache/lucene/search/RegexpQuery.java
##########
@@ -68,6 +68,19 @@ public RegexpQuery(Term term) {
     this(term, RegExp.ALL);
   }
   
+  /**
+   * Constructs a query for terms matching <code>term</code>.
+   * <p>
+   * By default, all regular expression features are enabled.
+   * </p>
+   * 
+   * @param term regular expression.
+   * @param caseSensitive set to false for case insensitive matching 
+   */
+  public RegexpQuery(Term term, boolean caseSensitive) {
+    this(term, RegExp.ALL, caseSensitive);
+  }  

Review comment:
       In order not to have a combinatorial explosion of the number of ctors, I 
think we could consider dropping this one: I think it's fine to require users 
to provide flags if they also want to configure case sensitivity?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] jpountz commented on a change in pull request #1541: RegExp - add case insensitive matching option

Reply via email to