rmuir commented on code in PR #14192: URL: https://github.com/apache/lucene/pull/14192#discussion_r1940371117
########## lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java: ########## @@ -696,17 +896,52 @@ private Automaton toAutomaton( return a; } - private Automaton toCaseInsensitiveChar(int codepoint) { - Automaton case1 = Automata.makeChar(codepoint); - // For now we only work with ASCII characters - if (codepoint > 128) { - return case1; + /** + * This function handles both ASCII and the remainder of the Unicode spec for generating + * case-insensitive alternatives. Specifically for Unicode some special handling is required + * particularly to ensure behavior is at parity with other regex engines like {@link + * java.util.regex.Pattern} + * + * <p>See the {@link #UNICODE_CASE_INSENSITIVE} flag for details on the set of known unstable + * alternative casings within the Unicode spec. + * + * @param codepoint the Character code point to encode as an Automaton + * @return the set of Automaton that represent the original code point and it's variants + */ + private Automaton toCaseInsensitiveChar( + int codepoint, boolean isAsciiInsensitive, boolean isUnicodeInsensitive) { + Automaton result; + if (isAsciiInsensitive || isUnicodeInsensitive) { + if (isUnicodeInsensitive) { + int[] altCodepoints = unstableUnicodeCharacters.get(codepoint); + if (altCodepoints != null) { + List<Automaton> altAutomaton = new ArrayList<>(altCodepoints.length); + for (int i = 0; i < altCodepoints.length; i++) { + altAutomaton.add(Automata.makeChar(altCodepoints[i])); + } + altAutomaton.add(Automata.makeChar(codepoint)); + result = Operations.union(altAutomaton); Review Comment: I opened #14193 to try to improve this case. It is hard to think about the worst-cases and so on today, both the current ascii-only code and this patch could just use an `int[]` and get back a minimal automaton from what I see. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org