rmuir commented on code in PR #14350: URL: https://github.com/apache/lucene/pull/14350#discussion_r2000536590
########## lucene/core/src/java/org/apache/lucene/util/automaton/CaseFolding.java: ########## @@ -743,4 +743,42 @@ static int[] lookupAlternates(int codepoint) { return alts; } + + /** + * Folds the case of the given character according to {@link Character#toLowerCase(int)}, but with + * exceptions if the turkic flag is set. + * + * @param codepoint to code point for the character to fold + * @param turkic if true, then apply tr/az folding rules + * @return the folded character + */ + static int foldCase(int codepoint, boolean turkic) { + if (turkic) { + if (codepoint == 0x00130) { // İ [LATIN CAPITAL LETTER I WITH DOT ABOVE] + return 0x00069; // i [LATIN SMALL LETTER I] + } else if (codepoint == 0x000049) { // I [LATIN CAPITAL LETTER I] + return 0x00131; // ı [LATIN SMALL LETTER DOTLESS I] + } + } + return Character.toLowerCase(codepoint); Review Comment: For real case folding we have to do more than this. it is a simple 1-1 mapping but e.g. `Σ`, `σ`, and `ς`, will all fold to σ. Whereas toLowerCase(ς) = ς. Because it is already in lower-case, just in final-form. This is just an example. To see more, compare your function against ICU UCharacter.foldCase(int, bool) across all of unicode. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org