Re: [PR] Unicode Support for Case Insensitive Matching in RegExp [lucene]

via GitHub Tue, 04 Feb 2025 14:05:21 -0800


john-wagster commented on code in PR #14192:
URL: https://github.com/apache/lucene/pull/14192#discussion_r1941960613



##########
lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java:
##########
@@ -436,6 +478,160 @@ public enum Kind {
    */
   @Deprecated public static final int DEPRECATED_COMPLEMENT = 0x10000;
 
+  /**
+   * See {@link #UNICODE_CASE_INSENSITIVE} for more details on the set of 
known unstable alternative
+   * casings
+   */
+  static final Map<Integer, int[]> unstableUnicodeCharacters =
+      Map.ofEntries(
+          // these are the set of characters whose casing matches across 
multiple characters
+          entry(181, new int[] {924, 956}),
+          entry(924, new int[] {181, 956}),
+          entry(956, new int[] {181, 924}),
+          entry(304, new int[] {105, 73, 305}),
+          entry(105, new int[] {304, 73, 305}),
+          entry(73, new int[] {304, 105, 305}),
+          entry(305, new int[] {304, 105, 73}),
+          entry(383, new int[] {83, 115}),
+          entry(83, new int[] {383, 115}),
+          entry(115, new int[] {383, 83}),
+          entry(837, new int[] {921, 953, 8126}),
+          entry(921, new int[] {837, 953, 8126}),
+          entry(953, new int[] {837, 921, 8126}),
+          entry(8126, new int[] {837, 921, 953}),
+          entry(962, new int[] {931, 963}),
+          entry(931, new int[] {962, 963}),
+          entry(963, new int[] {962, 931}),
+          entry(976, new int[] {914, 946}),
+          entry(914, new int[] {976, 946}),
+          entry(946, new int[] {976, 914}),
+          entry(977, new int[] {920, 952, 1012}),
+          entry(920, new int[] {977, 952, 1012}),
+          entry(952, new int[] {977, 920, 1012}),
+          entry(1012, new int[] {977, 920, 952}),
+          entry(981, new int[] {934, 966}),
+          entry(934, new int[] {981, 966}),
+          entry(966, new int[] {981, 934}),
+          entry(982, new int[] {928, 960}),
+          entry(928, new int[] {982, 960}),
+          entry(960, new int[] {982, 928}),
+          entry(1008, new int[] {922, 954}),
+          entry(922, new int[] {1008, 954}),
+          entry(954, new int[] {1008, 922}),
+          entry(1009, new int[] {929, 961}),
+          entry(929, new int[] {1009, 961}),
+          entry(961, new int[] {1009, 929}),
+          entry(1013, new int[] {917, 949}),
+          entry(917, new int[] {1013, 949}),
+          entry(949, new int[] {1013, 917}),
+          entry(7296, new int[] {1042, 1074}),
+          entry(1042, new int[] {7296, 1074}),
+          entry(1074, new int[] {7296, 1042}),
+          entry(7297, new int[] {1044, 1076}),
+          entry(1044, new int[] {7297, 1076}),
+          entry(1076, new int[] {7297, 1044}),
+          entry(7298, new int[] {1054, 1086}),
+          entry(1054, new int[] {7298, 1086}),
+          entry(1086, new int[] {7298, 1054}),
+          entry(7299, new int[] {1057, 1089}),
+          entry(1057, new int[] {7299, 1089}),
+          entry(1089, new int[] {7299, 1057}),
+          entry(7300, new int[] {1058, 1090, 7301}),
+          entry(1058, new int[] {7300, 1090, 7301}),
+          entry(1090, new int[] {7300, 1058, 7301}),
+          entry(7301, new int[] {7300, 1058, 1090}),
+          entry(7302, new int[] {1066, 1098}),
+          entry(1066, new int[] {7302, 1098}),
+          entry(1098, new int[] {7302, 1066}),
+          entry(7303, new int[] {1122, 1123}),
+          entry(1122, new int[] {7303, 1123}),
+          entry(1123, new int[] {7303, 1122}),
+          entry(7304, new int[] {42570, 42571}),
+          entry(42570, new int[] {7304, 42571}),
+          entry(42571, new int[] {7304, 42570}),
+          entry(7835, new int[] {7776, 7777}),
+          entry(7776, new int[] {7835, 7777}),
+          entry(7777, new int[] {7835, 7776}),
+          entry(8486, new int[] {969, 937}),
+          entry(969, new int[] {8486, 937}),
+          entry(937, new int[] {8486, 969}),
+          entry(8490, new int[] {107, 75}),
+          entry(107, new int[] {8490, 75}),
+          entry(75, new int[] {8490, 107}),
+          entry(8491, new int[] {229, 197}),
+          entry(229, new int[] {8491, 197}),
+          entry(197, new int[] {8491, 229}),

Review Comment:
   I'm sold.  I was comparing the CaseFolding spec with the Pattern and String 
classes today and found a (albeit small) number of characters that neither 
correctly supports from the unicode spec which seemed to me that should pretty 
clearly be supported such as: 
   
   https://util.unicode.org/UnicodeJsps/character.jsp?a=1C8A&B1=Show
   https://util.unicode.org/UnicodeJsps/character.jsp?a=0x1C89&B1=Show
   
   I have some changes related to your other comments but I'm going to iterate 
a bit further and I'll likely need some additional feedback on an approach that 
incorporates that spec / we may need to talk about how to maintain the case 
folding as the unicode spec iterates.  I was hoping to rely heavily on the 
Pattern class being updated so we didn't have to do as much maintenance on our 
end to make sure case insensitivity continued to work appropriately.
   
   Also just to be clear the intention of the PR is just handling character 
casing not normalization (just forgot to address that yesterday). 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Unicode Support for Case Insensitive Matching in RegExp [lucene]

Reply via email to