john-wagster commented on code in PR #14192:
URL: https://github.com/apache/lucene/pull/14192#discussion_r1941963110


##########
lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java:
##########
@@ -696,17 +896,52 @@ private Automaton toAutomaton(
     return a;
   }
 
-  private Automaton toCaseInsensitiveChar(int codepoint) {
-    Automaton case1 = Automata.makeChar(codepoint);
-    // For now we only work with ASCII characters
-    if (codepoint > 128) {
-      return case1;
+  /**
+   * This function handles both ASCII and the remainder of the Unicode spec 
for generating
+   * case-insensitive alternatives. Specifically for Unicode some special 
handling is required
+   * particularly to ensure behavior is at parity with other regex engines 
like {@link
+   * java.util.regex.Pattern}
+   *
+   * <p>See the {@link #UNICODE_CASE_INSENSITIVE} flag for details on the set 
of known unstable
+   * alternative casings within the Unicode spec.
+   *
+   * @param codepoint the Character code point to encode as an Automaton
+   * @return the set of Automaton that represent the original code point and 
it's variants
+   */
+  private Automaton toCaseInsensitiveChar(
+      int codepoint, boolean isAsciiInsensitive, boolean isUnicodeInsensitive) 
{
+    Automaton result;
+    if (isAsciiInsensitive || isUnicodeInsensitive) {
+      if (isUnicodeInsensitive) {
+        int[] altCodepoints = unstableUnicodeCharacters.get(codepoint);
+        if (altCodepoints != null) {
+          List<Automaton> altAutomaton = new ArrayList<>(altCodepoints.length);
+          for (int i = 0; i < altCodepoints.length; i++) {
+            altAutomaton.add(Automata.makeChar(altCodepoints[i]));
+          }
+          altAutomaton.add(Automata.makeChar(codepoint));
+          result = Operations.union(altAutomaton);

Review Comment:
   awesome; I took a brief look and it looked good to me.  Part of the reason I 
was building these up in a List was related to a PR that @mayya-sharipova 
submitted recently where she found performance issues with unions: 
https://github.com/apache/lucene/pull/14169



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to