Re: [PR] Unicode Support for Case Insensitive Matching in RegExp [lucene]

via GitHub Mon, 03 Feb 2025 18:45:12 -0800


rmuir commented on code in PR #14192:
URL: https://github.com/apache/lucene/pull/14192#discussion_r1940403543



##########
lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java:
##########
@@ -436,6 +478,160 @@ public enum Kind {
    */
   @Deprecated public static final int DEPRECATED_COMPLEMENT = 0x10000;
 
+  /**
+   * See {@link #UNICODE_CASE_INSENSITIVE} for more details on the set of 
known unstable alternative
+   * casings
+   */
+  static final Map<Integer, int[]> unstableUnicodeCharacters =
+      Map.ofEntries(
+          // these are the set of characters whose casing matches across 
multiple characters
+          entry(181, new int[] {924, 956}),

Review Comment:
   didn't mean to come off short here,
   
   But the hex is just a defacto standard, e.g. you'll see that representation 
in all the unicode data files. It helps to have an unambiguous way to refer to 
the same codepoint.
   
   There's quite a few lucene code reviewers that can look at the hex digits, 
and recognize things, maybe how many bytes it takes up in UTF-8, maybe whether 
it is compatibility character that might explode (﷽), maybe what block it is in 
and so on.
   
   Since they are all codepoints (`int`), it will help to do them all in hex 
like `0x1F602`, as you can represent all codepoints with it. Only bother with 
the `\uXXXX`, when dealing with java's UTF-16 `char` type.
    
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Unicode Support for Case Insensitive Matching in RegExp [lucene]

Reply via email to