Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

via GitHub Tue, 18 Mar 2025 18:46:12 -0700


rmuir commented on code in PR #14350:
URL: https://github.com/apache/lucene/pull/14350#discussion_r2002272398



##########
lucene/core/src/java/org/apache/lucene/util/automaton/CaseFolding.java:
##########
@@ -743,4 +743,42 @@ static int[] lookupAlternates(int codepoint) {
 
     return alts;
   }
+
+  /**
+   * Folds the case of the given character according to {@link 
Character#toLowerCase(int)}, but with
+   * exceptions if the turkic flag is set.
+   *
+   * @param codepoint to code point for the character to fold
+   * @param turkic if true, then apply tr/az folding rules
+   * @return the folded character
+   */
+  static int foldCase(int codepoint, boolean turkic) {
+    if (turkic) {
+      if (codepoint == 0x00130) { // İ [LATIN CAPITAL LETTER I WITH DOT ABOVE]
+        return 0x00069; // i [LATIN SMALL LETTER I]
+      } else if (codepoint == 0x000049) { //  I [LATIN CAPITAL LETTER I]
+        return 0x00131; // ı [LATIN SMALL LETTER DOTLESS I]
+      }
+    }
+    return Character.toLowerCase(codepoint);

Review Comment:
   Maybe, depending what we are going to do with it? if done correctly we could 
replace `LowerCaseFilter`, `GreekLowerCaseFilter`, etc in analysis chain. Of 
course "correctly" there is a difficult bar, as it would impact 100% of users 
in a very visible way and could easily bottleneck indexing / waste resources if 
not done correctly. For example large-arrays-of-objects or even primitives is a 
big no here. See https://www.strchr.com/multi-stage_tables and look at what JDK 
and ICU do already.
   
   But for the purpose of this PR, we may want to start simpler (this is the 
same approach I mentioned on regex caseless PR). We should avoid huge arrays 
and large data files in lucene-core, just for adding more inefficient user 
regular expressions that isn't really related to searching. On the other hand, 
if we are going to get serious benefit everywhere (e.g. improve all analyzers), 
then maybe the tradeoff makes sense.
   
   And I don't understand why we'd parse text files versus just write any 
generator itself to use ICU... especially since we already use such an approach 
in the build already: 
https://github.com/apache/lucene/blob/main/gradle/generation/icu/GenerateUnicodeProps.groovy
   
   Still I wouldn't immediately jump to generation as a start, it is a lot of 
work, and we should iterate. First i'd compare 
`Character.toLowerCase(Character.toUpperCase(x))` to `UCharacter.foldCase(int, 
false)` to see what the delta really needs to be as far as data. I'd expect 
this to be very small. You can start prototyping with that instead of investing 
a ton of up-front time.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

Reply via email to