rmuir commented on code in PR #14350:
URL: https://github.com/apache/lucene/pull/14350#discussion_r2000536590
##
lucene/core/src/java/org/apache/lucene/util/automaton/CaseFolding.java:
##
@@ -743,4 +743,42 @@ static int[] lookupAlternates(int codepoint) {
return alts;
rmuir commented on PR #14350:
URL: https://github.com/apache/lucene/pull/14350#issuecomment-2744342290
Maybe this one helps the issue: https://github.com/apache/lucene/pull/14389
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub
rmuir commented on code in PR #14350:
URL: https://github.com/apache/lucene/pull/14350#discussion_r2002272398
##
lucene/core/src/java/org/apache/lucene/util/automaton/CaseFolding.java:
##
@@ -743,4 +743,42 @@ static int[] lookupAlternates(int codepoint) {
return alts;
msfroh commented on code in PR #14350:
URL: https://github.com/apache/lucene/pull/14350#discussion_r2002184400
##
lucene/core/src/java/org/apache/lucene/util/automaton/CaseFolding.java:
##
@@ -743,4 +743,42 @@ static int[] lookupAlternates(int codepoint) {
return alts;
dweiss commented on code in PR #14350:
URL: https://github.com/apache/lucene/pull/14350#discussion_r2000465944
##
lucene/core/src/java/org/apache/lucene/util/automaton/Automata.java:
##
@@ -608,7 +608,24 @@ public static Automaton makeStringUnion(Iterable
utf8Strings) {
if
rmuir commented on PR #14350:
URL: https://github.com/apache/lucene/pull/14350#issuecomment-2731329809
it is confusing. because unicode case folding algorithm is supposed to work
for everyone. But here's the problem:
for most of the world:
* lowercase i has a dot, uppercase I has n
msfroh commented on PR #14350:
URL: https://github.com/apache/lucene/pull/14350#issuecomment-2730578138
Okay, got it! That's the piece that I was misunderstanding. I didn't realize
that Turkish/Azeri is the **only** other valid folding. I kept thinking of it
as just an example where the naï
rmuir commented on PR #14350:
URL: https://github.com/apache/lucene/pull/14350#issuecomment-2730410784
If you want to do fancy romanian accent removal, use an analyzer and
normalize your data. That's what a search engine is all about.
But if we want to provide some limited runtime ex
rmuir commented on PR #14350:
URL: https://github.com/apache/lucene/pull/14350#issuecomment-2730397885
I think my ask is misunderstood, it is just to follow the Unicode standard.
There are two mappings for simple case folding:
* Default
* Alternate (Turkish/azeri)
--
This is an auto
msfroh commented on PR #14350:
URL: https://github.com/apache/lucene/pull/14350#issuecomment-2730390727
Instead of a boolean flag, what if we define an interface that specifies the
folding rules?
It could have two methods: one that folds input characters to a canonical
representatio
rmuir commented on PR #14350:
URL: https://github.com/apache/lucene/pull/14350#issuecomment-2726990337
Separately, it would be nice to add boolean flag (for turkish/azeri) to that
CaseFolding class, and fix it to do the right thing, so it doesn't match
unrelated characters in turkish. ultim
rmuir commented on PR #14350:
URL: https://github.com/apache/lucene/pull/14350#issuecomment-2726984173
+1 to start simple with Character.toLowerCase, thats the best you can get in
java.
The problem is java not having a Character.foldCase. A proper function would
look like ICU's `UCha
msfroh commented on PR #14350:
URL: https://github.com/apache/lucene/pull/14350#issuecomment-2726097192
Hmm... I'm thinking of just requiring that input is lowercase (per
`Character.lowerCase(c)`), then check for collisions on uppercase versions when
adding transitions, and throw an excepti
rmuir commented on PR #14350:
URL: https://github.com/apache/lucene/pull/14350#issuecomment-2725736846
It isn't a good idea. If the user wants to "erase case differences" then
they should apply `foldcase(ch)`. That's what case-folding means. That
CaseFolding class does everything, except, t
msfroh commented on PR #14350:
URL: https://github.com/apache/lucene/pull/14350#issuecomment-2725709282
This is kind of what I had in mind:
```java
private static int canonicalize(int codePoint) {
int[] alternatives = CaseFolding.lookupAlternates(codePoint);
if (alte
rmuir commented on PR #14350:
URL: https://github.com/apache/lucene/pull/14350#issuecomment-2724580292
This is why i recommended to not use the unicode function and to start
simple. Then you have a potential way to get it working efficiently.
--
This is an automated message from t
dweiss commented on PR #14350:
URL: https://github.com/apache/lucene/pull/14350#issuecomment-2725496380
Ok, fair enough.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To
rmuir commented on PR #14350:
URL: https://github.com/apache/lucene/pull/14350#issuecomment-2724585337
> Or we can just embrace the fact that it can be a non-minimal NFA and
justlet it run like that (with NFARunAutomaton).
I don't think this is currently a good option either: users wo
dweiss commented on PR #14350:
URL: https://github.com/apache/lucene/pull/14350#issuecomment-2724314718
Or we can just embrace the fact that it can be a non-minimal NFA and justlet
it run like that (with NFARunAutomaton).
--
This is an automated message from the Apache Git Service.
To res
dweiss commented on PR #14350:
URL: https://github.com/apache/lucene/pull/14350#issuecomment-2724062564
I don't know Unicode as well as Rob so I can't say what these alternate case
folding
equivalence classes are... but they definitely don't have a "canonical"
representation
with rega
msfroh commented on PR #14350:
URL: https://github.com/apache/lucene/pull/14350#issuecomment-2722961519
To the best of my understanding from reading the through the code while
sketching this PR, I believe it would produce a minimal DFA if every character
in a set of alternatives in the inpu
rmuir commented on PR #14350:
URL: https://github.com/apache/lucene/pull/14350#issuecomment-2722901888
> Would a check with `Character.isLowerCase()` on each input codepoint for
the case-insensitive case be sufficient to reject that kind of input across all
valid Unicode strings?
I d
msfroh commented on PR #14350:
URL: https://github.com/apache/lucene/pull/14350#issuecomment-2722105723
My thinking is that a query that uses this should lowercase, dedupe, and
sort the input before feeding it into `StringsToAutomaton`. That would handle
@dweiss's example (i.e. that input i
dweiss commented on PR #14350:
URL: https://github.com/apache/lucene/pull/14350#issuecomment-2720601955
I think this will work just fine in most cases and is a rather inexpensive
way to implement this case-insensitive matching, but this comes at the cost of
the output automaton that may not
dweiss commented on PR #14350:
URL: https://github.com/apache/lucene/pull/14350#issuecomment-2720681661
Crap, you're right. Didn't think of it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the
dweiss commented on PR #14350:
URL: https://github.com/apache/lucene/pull/14350#issuecomment-2720479164
> @dweiss understands this one the best, he implemented it.
... 15 years ago in LUCENE-3832. Thanks for putting so much trust in my
memory. I'll take a look.
--
This is an automa
dweiss commented on PR #14350:
URL: https://github.com/apache/lucene/pull/14350#issuecomment-2720790704
I also don't think you can make it deterministic in any trivial way.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and u
rmuir commented on PR #14350:
URL: https://github.com/apache/lucene/pull/14350#issuecomment-2720652612
Bigger downside: that example isn't deterministic either.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL a
msfroh commented on code in PR #14350:
URL: https://github.com/apache/lucene/pull/14350#discussion_r1992783083
##
lucene/core/src/java/org/apache/lucene/util/automaton/StringsToAutomaton.java:
##
@@ -209,7 +209,25 @@ private static int convert(
int i = 0;
int[] labels
msfroh commented on code in PR #14350:
URL: https://github.com/apache/lucene/pull/14350#discussion_r1992783083
##
lucene/core/src/java/org/apache/lucene/util/automaton/StringsToAutomaton.java:
##
@@ -209,7 +209,25 @@ private static int convert(
int i = 0;
int[] labels
rmuir commented on code in PR #14350:
URL: https://github.com/apache/lucene/pull/14350#discussion_r1992524713
##
lucene/core/src/java/org/apache/lucene/util/automaton/StringsToAutomaton.java:
##
@@ -209,7 +209,25 @@ private static int convert(
int i = 0;
int[] labels =
rmuir commented on PR #14350:
URL: https://github.com/apache/lucene/pull/14350#issuecomment-2719555494
OG paper: https://aclanthology.org/J00-1002.pdf
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go
rmuir commented on PR #14350:
URL: https://github.com/apache/lucene/pull/14350#issuecomment-2719541789
To me it seems potentially safe and practical addition. The idea would be
that, we can add transition "alternatives" (e.g. `A` vs `a`) and it doesn't
break the high-level algorithm, due to
rmuir commented on PR #14350:
URL: https://github.com/apache/lucene/pull/14350#issuecomment-2719524190
@dweiss understands this one the best, he implemented it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL a
34 matches
Mail list logo