john-wagster commented on code in PR #14192: URL: https://github.com/apache/lucene/pull/14192#discussion_r1941963110
########## lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java: ########## @@ -696,17 +896,52 @@ private Automaton toAutomaton( return a; } - private Automaton toCaseInsensitiveChar(int codepoint) { - Automaton case1 = Automata.makeChar(codepoint); - // For now we only work with ASCII characters - if (codepoint > 128) { - return case1; + /** + * This function handles both ASCII and the remainder of the Unicode spec for generating + * case-insensitive alternatives. Specifically for Unicode some special handling is required + * particularly to ensure behavior is at parity with other regex engines like {@link + * java.util.regex.Pattern} + * + * <p>See the {@link #UNICODE_CASE_INSENSITIVE} flag for details on the set of known unstable + * alternative casings within the Unicode spec. + * + * @param codepoint the Character code point to encode as an Automaton + * @return the set of Automaton that represent the original code point and it's variants + */ + private Automaton toCaseInsensitiveChar( + int codepoint, boolean isAsciiInsensitive, boolean isUnicodeInsensitive) { + Automaton result; + if (isAsciiInsensitive || isUnicodeInsensitive) { + if (isUnicodeInsensitive) { + int[] altCodepoints = unstableUnicodeCharacters.get(codepoint); + if (altCodepoints != null) { + List<Automaton> altAutomaton = new ArrayList<>(altCodepoints.length); + for (int i = 0; i < altCodepoints.length; i++) { + altAutomaton.add(Automata.makeChar(altCodepoints[i])); + } + altAutomaton.add(Automata.makeChar(codepoint)); + result = Operations.union(altAutomaton); Review Comment: awesome; I took a brief look and it looked good to me. Part of the reason I was building these up in a List was related to a PR that @mayya-sharipova submitted recently where she found performance issues with unions: https://github.com/apache/lucene/pull/14169 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org