Re: [PR] Unicode Support for Case Insensitive Matching in RegExp [lucene]

via GitHub Mon, 03 Feb 2025 13:07:02 -0800


john-wagster commented on code in PR #14192:
URL: https://github.com/apache/lucene/pull/14192#discussion_r1940023844



##########
lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java:
##########
@@ -424,6 +426,46 @@ public enum Kind {
   /** Allows case insensitive matching of ASCII characters. */
   public static final int ASCII_CASE_INSENSITIVE = 0x0100;
 
+  /**
+   * Allows case-insensitive matching of most Unicode characters.
+   *
+   * <p>In general the attempt is to reach parity with {@link 
java.util.regex.Pattern}
+   * Pattern.CASE_INSENSITIVE and Pattern.UNICODE_CASE flags when doing a 
case-insensitive match.
+   * This means characters like those representing the Greek symbol sigma (Σ, 
σ, ς) will all match
+   * one another
+   *
+   * <p>Some Unicode characters are difficult to correctly decode casing. In 
some cases Java's
+   * String class correctly handles decoding these but Java's {@link 
java.util.regex.Pattern} class
+   * does not. Again to keep parity and for performance reasons we are 
maintaining consistency with
+   * {@link java.util.regex.Pattern}. There are three known special classes of 
these characters we
+   * term unstable:
+   *
+   * <ul>
+   *   <li>1. the set of characters whose casing matches across multiple 
characters such as the
+   *       Greek sigma character mentioned above (Σ, σ, ς); we support these; 
notably some of these
+   *       characters fall into the ASCII range and so will behave differently 
when this flag is
+   *       enabled
+   *   <li>2. the set of characters that are neither in an upper nor lower 
case stable state and can
+   *       be both uppercased and lowercased from their current code point 
such as ǅ which when
+   *       uppercased produces Ǆ and when lowercased produces ǆ; we support 
these
+   *   <li>3. the set of characters that transition into or out of the Basic 
Multilingual Plane
+   *       (BMP). For performance reasons we ignore characters that transition 
for now, which is
+   *       consistent with {@link java.util.regex.Pattern}
+   * </ul>
+   *
+   * <p>Sometimes these classes of character will overlap; if a character is 
in both class 3 and any
+   * other case listed above it is ignored for the characters that transition; 
this is consistent
+   * with {@link java.util.regex.Pattern}. For instance: this character ῼ will 
match it's lowercase
+   * form ῳ in the BMP but not it's uppercase form outside the BMP: ΩΙ
+   *
+   * <p>These are the set of known characters that transition in or out of the 
BMP when cased such
+   * as ﬗ (code point: 64279 with 2 bytes) which when uppercased produces ՄԽ 
(code points: 1348 1341
+   * with 4 bytes) and are therefore ignored: 223, 304, 329, 496, 912, 944, 
1415, 7830-7834, 8016,
+   * 8018, 8020, 8022, 8064-8111, 8114-8116, 8118, 8119, 8124, 8130-8132, 
8134, 8135, 8140, 8146,
+   * 8147, 8150, 8151, 8162-8164, 8166-8180, 8182, 8183, 8188, 64256-64262, 
64275-64279
+   */

Review Comment:
   You are right; poking around a bit more to better understand what's 
happening here.  I was debugging through the Pattern class to see how this was 
handled there and naively assumed that it was a single character but it's being 
uppercased into two characters within the BMP.  Let me go back and double check 
these characters, revise the comment at least to be more generically calling 
out that we respect the same behavior as in Pattern related to not matching 
when going from a single code point to multiple code points and remove 
references to the BMP.  



##########
lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java:
##########
@@ -436,6 +478,160 @@ public enum Kind {
    */
   @Deprecated public static final int DEPRECATED_COMPLEMENT = 0x10000;
 
+  /**
+   * See {@link #UNICODE_CASE_INSENSITIVE} for more details on the set of 
known unstable alternative
+   * casings
+   */
+  static final Map<Integer, int[]> unstableUnicodeCharacters =
+      Map.ofEntries(
+          // these are the set of characters whose casing matches across 
multiple characters
+          entry(181, new int[] {924, 956}),

Review Comment:
   will do



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Unicode Support for Case Insensitive Matching in RegExp [lucene]

Reply via email to