john-wagster commented on code in PR #14192: URL: https://github.com/apache/lucene/pull/14192#discussion_r1940023844
########## lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java: ########## @@ -424,6 +426,46 @@ public enum Kind { /** Allows case insensitive matching of ASCII characters. */ public static final int ASCII_CASE_INSENSITIVE = 0x0100; + /** + * Allows case-insensitive matching of most Unicode characters. + * + * <p>In general the attempt is to reach parity with {@link java.util.regex.Pattern} + * Pattern.CASE_INSENSITIVE and Pattern.UNICODE_CASE flags when doing a case-insensitive match. + * This means characters like those representing the Greek symbol sigma (Σ, σ, ς) will all match + * one another + * + * <p>Some Unicode characters are difficult to correctly decode casing. In some cases Java's + * String class correctly handles decoding these but Java's {@link java.util.regex.Pattern} class + * does not. Again to keep parity and for performance reasons we are maintaining consistency with + * {@link java.util.regex.Pattern}. There are three known special classes of these characters we + * term unstable: + * + * <ul> + * <li>1. the set of characters whose casing matches across multiple characters such as the + * Greek sigma character mentioned above (Σ, σ, ς); we support these; notably some of these + * characters fall into the ASCII range and so will behave differently when this flag is + * enabled + * <li>2. the set of characters that are neither in an upper nor lower case stable state and can + * be both uppercased and lowercased from their current code point such as Dž which when + * uppercased produces DŽ and when lowercased produces dž; we support these + * <li>3. the set of characters that transition into or out of the Basic Multilingual Plane + * (BMP). For performance reasons we ignore characters that transition for now, which is + * consistent with {@link java.util.regex.Pattern} + * </ul> + * + * <p>Sometimes these classes of character will overlap; if a character is in both class 3 and any + * other case listed above it is ignored for the characters that transition; this is consistent + * with {@link java.util.regex.Pattern}. For instance: this character ῼ will match it's lowercase + * form ῳ in the BMP but not it's uppercase form outside the BMP: ΩΙ + * + * <p>These are the set of known characters that transition in or out of the BMP when cased such + * as ﬗ (code point: 64279 with 2 bytes) which when uppercased produces ՄԽ (code points: 1348 1341 + * with 4 bytes) and are therefore ignored: 223, 304, 329, 496, 912, 944, 1415, 7830-7834, 8016, + * 8018, 8020, 8022, 8064-8111, 8114-8116, 8118, 8119, 8124, 8130-8132, 8134, 8135, 8140, 8146, + * 8147, 8150, 8151, 8162-8164, 8166-8180, 8182, 8183, 8188, 64256-64262, 64275-64279 + */ Review Comment: You are right; poking around a bit more to better understand what's happening here. I was debugging through the Pattern class to see how this was handled there and naively assumed that it was a single character but it's being uppercased into two characters within the BMP. Let me go back and double check these characters, revise the comment at least to be more generically calling out that we respect the same behavior as in Pattern related to not matching when going from a single code point to multiple code points and remove references to the BMP. ########## lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java: ########## @@ -436,6 +478,160 @@ public enum Kind { */ @Deprecated public static final int DEPRECATED_COMPLEMENT = 0x10000; + /** + * See {@link #UNICODE_CASE_INSENSITIVE} for more details on the set of known unstable alternative + * casings + */ + static final Map<Integer, int[]> unstableUnicodeCharacters = + Map.ofEntries( + // these are the set of characters whose casing matches across multiple characters + entry(181, new int[] {924, 956}), Review Comment: will do -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org