rmuir commented on code in PR #14192: URL: https://github.com/apache/lucene/pull/14192#discussion_r1939944485
########## lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java: ########## @@ -424,6 +426,46 @@ public enum Kind { /** Allows case insensitive matching of ASCII characters. */ public static final int ASCII_CASE_INSENSITIVE = 0x0100; + /** + * Allows case-insensitive matching of most Unicode characters. + * + * <p>In general the attempt is to reach parity with {@link java.util.regex.Pattern} + * Pattern.CASE_INSENSITIVE and Pattern.UNICODE_CASE flags when doing a case-insensitive match. + * This means characters like those representing the Greek symbol sigma (Σ, σ, ς) will all match + * one another + * + * <p>Some Unicode characters are difficult to correctly decode casing. In some cases Java's + * String class correctly handles decoding these but Java's {@link java.util.regex.Pattern} class + * does not. Again to keep parity and for performance reasons we are maintaining consistency with + * {@link java.util.regex.Pattern}. There are three known special classes of these characters we + * term unstable: + * + * <ul> + * <li>1. the set of characters whose casing matches across multiple characters such as the + * Greek sigma character mentioned above (Σ, σ, ς); we support these; notably some of these + * characters fall into the ASCII range and so will behave differently when this flag is + * enabled + * <li>2. the set of characters that are neither in an upper nor lower case stable state and can + * be both uppercased and lowercased from their current code point such as Dž which when + * uppercased produces DŽ and when lowercased produces dž; we support these + * <li>3. the set of characters that transition into or out of the Basic Multilingual Plane + * (BMP). For performance reasons we ignore characters that transition for now, which is + * consistent with {@link java.util.regex.Pattern} + * </ul> + * + * <p>Sometimes these classes of character will overlap; if a character is in both class 3 and any + * other case listed above it is ignored for the characters that transition; this is consistent + * with {@link java.util.regex.Pattern}. For instance: this character ῼ will match it's lowercase + * form ῳ in the BMP but not it's uppercase form outside the BMP: ΩΙ + * + * <p>These are the set of known characters that transition in or out of the BMP when cased such + * as ﬗ (code point: 64279 with 2 bytes) which when uppercased produces ՄԽ (code points: 1348 1341 + * with 4 bytes) and are therefore ignored: 223, 304, 329, 496, 912, 944, 1415, 7830-7834, 8016, + * 8018, 8020, 8022, 8064-8111, 8114-8116, 8118, 8119, 8124, 8130-8132, 8134, 8135, 8140, 8146, + * 8147, 8150, 8151, 8162-8164, 8166-8180, 8182, 8183, 8188, 64256-64262, 64275-64279 + */ Review Comment: I don't understand this comment, where is this "in our out of BMP transitioning": ﬗ is U+FB17 `ARMENIAN SMALL LIGATURE MEN XEH` ՄԽ is U+0544 `ARMENIAN CAPITAL LETTER MEN` followed by U+053D `ARMENIAN CAPITAL LETTER XEH` All of this is in the BMP. So we need a better explanation of what this is all about since it isn't that. Using hex would help when reading it, but even with the decimal numbers, you can see that nothing in this comment is outside of the BMP, it is all less than 64k. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org