rmuir commented on code in PR #14192:
URL: https://github.com/apache/lucene/pull/14192#discussion_r1939946041
##########
lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java:
##########
@@ -436,6 +478,160 @@ public enum Kind {
*/
@Deprecated public static final int DEPRECATED_COMPLEMENT = 0x10000;
+ /**
+ * See {@link #UNICODE_CASE_INSENSITIVE} for more details on the set of
known unstable alternative
+ * casings
+ */
+ static final Map<Integer, int[]> unstableUnicodeCharacters =
+ Map.ofEntries(
+ // these are the set of characters whose casing matches across
multiple characters
+ entry(181, new int[] {924, 956}),
Review Comment:
please, use hex for the unicode everywhere
##########
lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java:
##########
@@ -424,6 +426,46 @@ public enum Kind {
/** Allows case insensitive matching of ASCII characters. */
public static final int ASCII_CASE_INSENSITIVE = 0x0100;
+ /**
+ * Allows case-insensitive matching of most Unicode characters.
+ *
+ * <p>In general the attempt is to reach parity with {@link
java.util.regex.Pattern}
+ * Pattern.CASE_INSENSITIVE and Pattern.UNICODE_CASE flags when doing a
case-insensitive match.
+ * This means characters like those representing the Greek symbol sigma (Σ,
σ, ς) will all match
+ * one another
+ *
+ * <p>Some Unicode characters are difficult to correctly decode casing. In
some cases Java's
+ * String class correctly handles decoding these but Java's {@link
java.util.regex.Pattern} class
+ * does not. Again to keep parity and for performance reasons we are
maintaining consistency with
+ * {@link java.util.regex.Pattern}. There are three known special classes of
these characters we
+ * term unstable:
+ *
+ * <ul>
+ * <li>1. the set of characters whose casing matches across multiple
characters such as the
+ * Greek sigma character mentioned above (Σ, σ, ς); we support these;
notably some of these
+ * characters fall into the ASCII range and so will behave differently
when this flag is
+ * enabled
+ * <li>2. the set of characters that are neither in an upper nor lower
case stable state and can
+ * be both uppercased and lowercased from their current code point
such as Dž which when
+ * uppercased produces DŽ and when lowercased produces dž; we support
these
+ * <li>3. the set of characters that transition into or out of the Basic
Multilingual Plane
+ * (BMP). For performance reasons we ignore characters that transition
for now, which is
+ * consistent with {@link java.util.regex.Pattern}
+ * </ul>
+ *
+ * <p>Sometimes these classes of character will overlap; if a character is
in both class 3 and any
+ * other case listed above it is ignored for the characters that transition;
this is consistent
+ * with {@link java.util.regex.Pattern}. For instance: this character ῼ will
match it's lowercase
+ * form ῳ in the BMP but not it's uppercase form outside the BMP: ΩΙ
+ *
+ * <p>These are the set of known characters that transition in or out of the
BMP when cased such
+ * as ﬗ (code point: 64279 with 2 bytes) which when uppercased produces ՄԽ
(code points: 1348 1341
+ * with 4 bytes) and are therefore ignored: 223, 304, 329, 496, 912, 944,
1415, 7830-7834, 8016,
+ * 8018, 8020, 8022, 8064-8111, 8114-8116, 8118, 8119, 8124, 8130-8132,
8134, 8135, 8140, 8146,
+ * 8147, 8150, 8151, 8162-8164, 8166-8180, 8182, 8183, 8188, 64256-64262,
64275-64279
+ */
Review Comment:
I don't understand this comment, where is this "in our out of BMP
transitioning":`
ﬗ is U+FB17 `ARMENIAN SMALL LIGATURE MEN XEH`
ՄԽ is U+0544 `ARMENIAN CAPITAL LETTER MEN` followed by U+053D `ARMENIAN
CAPITAL LETTER XEH`
All of this is in the BMP. So we need a better explanation of what this is
all about since it isn't that.
Using hex would help when reading it, but even with the decimal numbers, you
can see that nothing in this comment is outside of the BMP, it is all less than
64k.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]