rmuir commented on code in PR #14192:
URL: https://github.com/apache/lucene/pull/14192#discussion_r1942013537
##########
lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java:
##########
@@ -696,17 +896,52 @@ private Automaton toAutomaton(
return a;
}
- private Automaton toCaseInsensitiveChar(int codepoint) {
- Automaton case1 = Automata.makeChar(codepoint);
- // For now we only work with ASCII characters
- if (codepoint > 128) {
- return case1;
+ /**
+ * This function handles both ASCII and the remainder of the Unicode spec
for generating
+ * case-insensitive alternatives. Specifically for Unicode some special
handling is required
+ * particularly to ensure behavior is at parity with other regex engines
like {@link
+ * java.util.regex.Pattern}
+ *
+ * <p>See the {@link #UNICODE_CASE_INSENSITIVE} flag for details on the set
of known unstable
+ * alternative casings within the Unicode spec.
+ *
+ * @param codepoint the Character code point to encode as an Automaton
+ * @return the set of Automaton that represent the original code point and
it's variants
+ */
+ private Automaton toCaseInsensitiveChar(
+ int codepoint, boolean isAsciiInsensitive, boolean isUnicodeInsensitive)
{
+ Automaton result;
+ if (isAsciiInsensitive || isUnicodeInsensitive) {
+ if (isUnicodeInsensitive) {
+ int[] altCodepoints = unstableUnicodeCharacters.get(codepoint);
+ if (altCodepoints != null) {
+ List<Automaton> altAutomaton = new ArrayList<>(altCodepoints.length);
+ for (int i = 0; i < altCodepoints.length; i++) {
+ altAutomaton.add(Automata.makeChar(altCodepoints[i]));
+ }
+ altAutomaton.add(Automata.makeChar(codepoint));
+ result = Operations.union(altAutomaton);
Review Comment:
yeah that PR does not fix any hard problem (optimizing `union` and
`concatenate` to be better in general). But it can fix your problem.
and we could test, but I'd guess that this PR should be very fast and we
should consider the option of deprecating the previous ASCII option and just
calling this one `CASE_INSENSITIVE`? because we can make the argument it is
just as fast or faster than previous ASCII option and end out with a cleaner
API and maintenance.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]