Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-04-05 Thread via GitHub
rmuir commented on code in PR #14350: URL: https://github.com/apache/lucene/pull/14350#discussion_r2000536590 ## lucene/core/src/java/org/apache/lucene/util/automaton/CaseFolding.java: ## @@ -743,4 +743,42 @@ static int[] lookupAlternates(int codepoint) { return alts;

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-21 Thread via GitHub
rmuir commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2744342290 Maybe this one helps the issue: https://github.com/apache/lucene/pull/14389 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-18 Thread via GitHub
rmuir commented on code in PR #14350: URL: https://github.com/apache/lucene/pull/14350#discussion_r2002272398 ## lucene/core/src/java/org/apache/lucene/util/automaton/CaseFolding.java: ## @@ -743,4 +743,42 @@ static int[] lookupAlternates(int codepoint) { return alts;

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-18 Thread via GitHub
msfroh commented on code in PR #14350: URL: https://github.com/apache/lucene/pull/14350#discussion_r2002184400 ## lucene/core/src/java/org/apache/lucene/util/automaton/CaseFolding.java: ## @@ -743,4 +743,42 @@ static int[] lookupAlternates(int codepoint) { return alts;

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-18 Thread via GitHub
dweiss commented on code in PR #14350: URL: https://github.com/apache/lucene/pull/14350#discussion_r2000465944 ## lucene/core/src/java/org/apache/lucene/util/automaton/Automata.java: ## @@ -608,7 +608,24 @@ public static Automaton makeStringUnion(Iterable utf8Strings) { if

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-17 Thread via GitHub
rmuir commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2731329809 it is confusing. because unicode case folding algorithm is supposed to work for everyone. But here's the problem: for most of the world: * lowercase i has a dot, uppercase I has n

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-17 Thread via GitHub
msfroh commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2730578138 Okay, got it! That's the piece that I was misunderstanding. I didn't realize that Turkish/Azeri is the **only** other valid folding. I kept thinking of it as just an example where the naï

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-17 Thread via GitHub
rmuir commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2730410784 If you want to do fancy romanian accent removal, use an analyzer and normalize your data. That's what a search engine is all about. But if we want to provide some limited runtime ex

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-17 Thread via GitHub
rmuir commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2730397885 I think my ask is misunderstood, it is just to follow the Unicode standard. There are two mappings for simple case folding: * Default * Alternate (Turkish/azeri) -- This is an auto

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-17 Thread via GitHub
msfroh commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2730390727 Instead of a boolean flag, what if we define an interface that specifies the folding rules? It could have two methods: one that folds input characters to a canonical representatio

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-15 Thread via GitHub
rmuir commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2726990337 Separately, it would be nice to add boolean flag (for turkish/azeri) to that CaseFolding class, and fix it to do the right thing, so it doesn't match unrelated characters in turkish. ultim

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-15 Thread via GitHub
rmuir commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2726984173 +1 to start simple with Character.toLowerCase, thats the best you can get in java. The problem is java not having a Character.foldCase. A proper function would look like ICU's `UCha

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-14 Thread via GitHub
msfroh commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2726097192 Hmm... I'm thinking of just requiring that input is lowercase (per `Character.lowerCase(c)`), then check for collisions on uppercase versions when adding transitions, and throw an excepti

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-14 Thread via GitHub
rmuir commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2725736846 It isn't a good idea. If the user wants to "erase case differences" then they should apply `foldcase(ch)`. That's what case-folding means. That CaseFolding class does everything, except, t

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-14 Thread via GitHub
msfroh commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2725709282 This is kind of what I had in mind: ```java private static int canonicalize(int codePoint) { int[] alternatives = CaseFolding.lookupAlternates(codePoint); if (alte

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-14 Thread via GitHub
rmuir commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2724580292 This is why i recommended to not use the unicode function and to start simple. Then you have a potential way to get it working efficiently. -- This is an automated message from t

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-14 Thread via GitHub
dweiss commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2725496380 Ok, fair enough. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-14 Thread via GitHub
rmuir commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2724585337 > Or we can just embrace the fact that it can be a non-minimal NFA and justlet it run like that (with NFARunAutomaton). I don't think this is currently a good option either: users wo

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-14 Thread via GitHub
dweiss commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2724314718 Or we can just embrace the fact that it can be a non-minimal NFA and justlet it run like that (with NFARunAutomaton). -- This is an automated message from the Apache Git Service. To res

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-14 Thread via GitHub
dweiss commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2724062564 I don't know Unicode as well as Rob so I can't say what these alternate case folding equivalence classes are... but they definitely don't have a "canonical" representation with rega

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-13 Thread via GitHub
msfroh commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2722961519 To the best of my understanding from reading the through the code while sketching this PR, I believe it would produce a minimal DFA if every character in a set of alternatives in the inpu

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-13 Thread via GitHub
rmuir commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2722901888 > Would a check with `Character.isLowerCase()` on each input codepoint for the case-insensitive case be sufficient to reject that kind of input across all valid Unicode strings? I d

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-13 Thread via GitHub
msfroh commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2722105723 My thinking is that a query that uses this should lowercase, dedupe, and sort the input before feeding it into `StringsToAutomaton`. That would handle @dweiss's example (i.e. that input i

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-13 Thread via GitHub
dweiss commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2720601955 I think this will work just fine in most cases and is a rather inexpensive way to implement this case-insensitive matching, but this comes at the cost of the output automaton that may not

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-13 Thread via GitHub
dweiss commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2720681661 Crap, you're right. Didn't think of it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-13 Thread via GitHub
dweiss commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2720479164 > @dweiss understands this one the best, he implemented it. ... 15 years ago in LUCENE-3832. Thanks for putting so much trust in my memory. I'll take a look. -- This is an automa

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-13 Thread via GitHub
dweiss commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2720790704 I also don't think you can make it deterministic in any trivial way. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and u

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-13 Thread via GitHub
rmuir commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2720652612 Bigger downside: that example isn't deterministic either. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL a

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-12 Thread via GitHub
msfroh commented on code in PR #14350: URL: https://github.com/apache/lucene/pull/14350#discussion_r1992783083 ## lucene/core/src/java/org/apache/lucene/util/automaton/StringsToAutomaton.java: ## @@ -209,7 +209,25 @@ private static int convert( int i = 0; int[] labels

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-12 Thread via GitHub
msfroh commented on code in PR #14350: URL: https://github.com/apache/lucene/pull/14350#discussion_r1992783083 ## lucene/core/src/java/org/apache/lucene/util/automaton/StringsToAutomaton.java: ## @@ -209,7 +209,25 @@ private static int convert( int i = 0; int[] labels

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-12 Thread via GitHub
rmuir commented on code in PR #14350: URL: https://github.com/apache/lucene/pull/14350#discussion_r1992524713 ## lucene/core/src/java/org/apache/lucene/util/automaton/StringsToAutomaton.java: ## @@ -209,7 +209,25 @@ private static int convert( int i = 0; int[] labels =

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-12 Thread via GitHub
rmuir commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2719555494 OG paper: https://aclanthology.org/J00-1002.pdf -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-12 Thread via GitHub
rmuir commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2719541789 To me it seems potentially safe and practical addition. The idea would be that, we can add transition "alternatives" (e.g. `A` vs `a`) and it doesn't break the high-level algorithm, due to

Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]

2025-03-12 Thread via GitHub
rmuir commented on PR #14350: URL: https://github.com/apache/lucene/pull/14350#issuecomment-2719524190 @dweiss understands this one the best, he implemented it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL a