Re: [PR] Unicode Support for Case Insensitive Matching in RegExp [lucene]

via GitHub Tue, 04 Feb 2025 15:05:06 -0800


rmuir commented on code in PR #14192:
URL: https://github.com/apache/lucene/pull/14192#discussion_r1942013537



##########
lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java:
##########
@@ -696,17 +896,52 @@ private Automaton toAutomaton(
     return a;
   }
 
-  private Automaton toCaseInsensitiveChar(int codepoint) {
-    Automaton case1 = Automata.makeChar(codepoint);
-    // For now we only work with ASCII characters
-    if (codepoint > 128) {
-      return case1;
+  /**
+   * This function handles both ASCII and the remainder of the Unicode spec 
for generating
+   * case-insensitive alternatives. Specifically for Unicode some special 
handling is required
+   * particularly to ensure behavior is at parity with other regex engines 
like {@link
+   * java.util.regex.Pattern}
+   *
+   * <p>See the {@link #UNICODE_CASE_INSENSITIVE} flag for details on the set 
of known unstable
+   * alternative casings within the Unicode spec.
+   *
+   * @param codepoint the Character code point to encode as an Automaton
+   * @return the set of Automaton that represent the original code point and 
it's variants
+   */
+  private Automaton toCaseInsensitiveChar(
+      int codepoint, boolean isAsciiInsensitive, boolean isUnicodeInsensitive) 
{
+    Automaton result;
+    if (isAsciiInsensitive || isUnicodeInsensitive) {
+      if (isUnicodeInsensitive) {
+        int[] altCodepoints = unstableUnicodeCharacters.get(codepoint);
+        if (altCodepoints != null) {
+          List<Automaton> altAutomaton = new ArrayList<>(altCodepoints.length);
+          for (int i = 0; i < altCodepoints.length; i++) {
+            altAutomaton.add(Automata.makeChar(altCodepoints[i]));
+          }
+          altAutomaton.add(Automata.makeChar(codepoint));
+          result = Operations.union(altAutomaton);

Review Comment:
   yeah that PR does not fix any hard problem (optimizing `union` and 
`concatenate` to be better in general). But it can fix your problem. 
   
   and we could test, but I'd guess that this PR should be very fast and we 
should consider the option of deprecating the previous ASCII option and just 
calling this one `CASE_INSENSITIVE`? because we can make the argument it is 
just as fast or faster than previous ASCII option and end out with a cleaner 
API and maintenance.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Unicode Support for Case Insensitive Matching in RegExp [lucene]

Reply via email to