Re: [PR] Unicode Support for Case Insensitive Matching in RegExp [lucene]

via GitHub Mon, 03 Feb 2025 12:00:21 -0800


rmuir commented on code in PR #14192:
URL: https://github.com/apache/lucene/pull/14192#discussion_r1939944485



##########
lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java:
##########
@@ -424,6 +426,46 @@ public enum Kind {
   /** Allows case insensitive matching of ASCII characters. */
   public static final int ASCII_CASE_INSENSITIVE = 0x0100;
 
+  /**
+   * Allows case-insensitive matching of most Unicode characters.
+   *
+   * <p>In general the attempt is to reach parity with {@link 
java.util.regex.Pattern}
+   * Pattern.CASE_INSENSITIVE and Pattern.UNICODE_CASE flags when doing a 
case-insensitive match.
+   * This means characters like those representing the Greek symbol sigma (Σ, 
σ, ς) will all match
+   * one another
+   *
+   * <p>Some Unicode characters are difficult to correctly decode casing. In 
some cases Java's
+   * String class correctly handles decoding these but Java's {@link 
java.util.regex.Pattern} class
+   * does not. Again to keep parity and for performance reasons we are 
maintaining consistency with
+   * {@link java.util.regex.Pattern}. There are three known special classes of 
these characters we
+   * term unstable:
+   *
+   * <ul>
+   *   <li>1. the set of characters whose casing matches across multiple 
characters such as the
+   *       Greek sigma character mentioned above (Σ, σ, ς); we support these; 
notably some of these
+   *       characters fall into the ASCII range and so will behave differently 
when this flag is
+   *       enabled
+   *   <li>2. the set of characters that are neither in an upper nor lower 
case stable state and can
+   *       be both uppercased and lowercased from their current code point 
such as ǅ which when
+   *       uppercased produces Ǆ and when lowercased produces ǆ; we support 
these
+   *   <li>3. the set of characters that transition into or out of the Basic 
Multilingual Plane
+   *       (BMP). For performance reasons we ignore characters that transition 
for now, which is
+   *       consistent with {@link java.util.regex.Pattern}
+   * </ul>
+   *
+   * <p>Sometimes these classes of character will overlap; if a character is 
in both class 3 and any
+   * other case listed above it is ignored for the characters that transition; 
this is consistent
+   * with {@link java.util.regex.Pattern}. For instance: this character ῼ will 
match it's lowercase
+   * form ῳ in the BMP but not it's uppercase form outside the BMP: ΩΙ
+   *
+   * <p>These are the set of known characters that transition in or out of the 
BMP when cased such
+   * as ﬗ (code point: 64279 with 2 bytes) which when uppercased produces ՄԽ 
(code points: 1348 1341
+   * with 4 bytes) and are therefore ignored: 223, 304, 329, 496, 912, 944, 
1415, 7830-7834, 8016,
+   * 8018, 8020, 8022, 8064-8111, 8114-8116, 8118, 8119, 8124, 8130-8132, 
8134, 8135, 8140, 8146,
+   * 8147, 8150, 8151, 8162-8164, 8166-8180, 8182, 8183, 8188, 64256-64262, 
64275-64279
+   */

Review Comment:
   I don't understand this comment, where is this "in our out of BMP 
transitioning":
   
   ﬗ is U+FB17 `ARMENIAN SMALL LIGATURE MEN XEH`
   ՄԽ is U+0544 `ARMENIAN CAPITAL LETTER MEN` followed by U+053D `ARMENIAN 
CAPITAL LETTER XEH`
   
   All of this is in the BMP. So we need a better explanation of what this is 
all about since it isn't that.
   
   Using hex would help when reading it, but even with the decimal numbers, you 
can see that nothing in this comment is outside of the BMP, it is all less than 
64k.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Unicode Support for Case Insensitive Matching in RegExp [lucene]

Reply via email to