rmuir commented on code in PR #14350: URL: https://github.com/apache/lucene/pull/14350#discussion_r2002272398
########## lucene/core/src/java/org/apache/lucene/util/automaton/CaseFolding.java: ########## @@ -743,4 +743,42 @@ static int[] lookupAlternates(int codepoint) { return alts; } + + /** + * Folds the case of the given character according to {@link Character#toLowerCase(int)}, but with + * exceptions if the turkic flag is set. + * + * @param codepoint to code point for the character to fold + * @param turkic if true, then apply tr/az folding rules + * @return the folded character + */ + static int foldCase(int codepoint, boolean turkic) { + if (turkic) { + if (codepoint == 0x00130) { // İ [LATIN CAPITAL LETTER I WITH DOT ABOVE] + return 0x00069; // i [LATIN SMALL LETTER I] + } else if (codepoint == 0x000049) { // I [LATIN CAPITAL LETTER I] + return 0x00131; // ı [LATIN SMALL LETTER DOTLESS I] + } + } + return Character.toLowerCase(codepoint); Review Comment: Maybe, depending what we are going to do with it? if done correctly we could replace `LowerCaseFilter`, `GreekLowerCaseFilter`, etc in analysis chain. Of course "correctly" there is a difficult bar, as it would impact 100% of users in a very visible way and could easily bottleneck indexing / waste resources if not done correctly. For example large-arrays-of-objects or even primitives is a big no here. See https://www.strchr.com/multi-stage_tables and look at what JDK and ICU do already. But for the purpose of this PR, we may want to start simpler (this is the same approach I mentioned on regex caseless PR). We should avoid huge arrays and large data files in lucene-core, just for adding more inefficient user regular expressions that isn't really related to searching. On the other hand, if we are going to get serious benefit everywhere (e.g. improve all analyzers), then maybe the tradeoff makes sense. And I don't understand why we'd parse text files versus just write any generator itself to use ICU... especially since we already use such an approach in the build already: https://github.com/apache/lucene/blob/main/gradle/generation/icu/GenerateUnicodeProps.groovy Still I wouldn't immediately jump to generation as a start, it is a lot of work, and we should iterate. First i'd compare `Character.toLowerCase(Character.toUpperCase(x))` to `UCharacter.foldCase(int, false)` to see what the delta really needs to be as far as data. I'd expect this to be very small. You can start prototyping with that instead of investing a ton of up-front time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org