Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

via GitHub Sat, 05 Apr 2025 11:57:42 -0700


rmuir commented on code in PR #14350:
URL: https://github.com/apache/lucene/pull/14350#discussion_r2000536590



##########
lucene/core/src/java/org/apache/lucene/util/automaton/CaseFolding.java:
##########
@@ -743,4 +743,42 @@ static int[] lookupAlternates(int codepoint) {
 
     return alts;
   }
+
+  /**
+   * Folds the case of the given character according to {@link 
Character#toLowerCase(int)}, but with
+   * exceptions if the turkic flag is set.
+   *
+   * @param codepoint to code point for the character to fold
+   * @param turkic if true, then apply tr/az folding rules
+   * @return the folded character
+   */
+  static int foldCase(int codepoint, boolean turkic) {
+    if (turkic) {
+      if (codepoint == 0x00130) { // İ [LATIN CAPITAL LETTER I WITH DOT ABOVE]
+        return 0x00069; // i [LATIN SMALL LETTER I]
+      } else if (codepoint == 0x000049) { //  I [LATIN CAPITAL LETTER I]
+        return 0x00131; // ı [LATIN SMALL LETTER DOTLESS I]
+      }
+    }
+    return Character.toLowerCase(codepoint);

Review Comment:
   For real case folding we have to do more than this. it is a simple 1-1 
mapping but e.g. `Σ`, `σ`, and `ς`, will all fold to σ. Whereas toLowerCase(ς) 
= ς. Because it is already in lower-case, just in final-form. This is just an 
example. To see more, compare your function against ICU 
UCharacter.foldCase(int, bool) across all of unicode.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

Reply via email to