john-wagster opened a new pull request, #14192: URL: https://github.com/apache/lucene/pull/14192
About four years ago ASCII-only case insensitive matching (https://github.com/apache/lucene-solr/pull/1541) was added to Lucene. In the past couple of a years a couple of requests have been made related to case insensitive matching in Elasticsearch across other parts of UTF-8 which uses the `RegExp` regex `Automaton` in Lucene. Previous discussions around this for the ASCII-only work suggested that this task may be controversial. So I've spent a bit of time exploring options and have submitted this PR as, I believe, the best direction to take related to that support, but welcome feedback on this approach. tl;dr the approach I've taken is to mirror `java.util.regex.Pattern` with the belief that any downstream users of products like Elasticsearch would expect and welcome consistency in Unicode's edge-cases with case insensitivity being handled the same between both Java's `Pattern` class and Lucene's `RegExp` class More specifically @jpountz brought up concerns around the handling of characters such as sigma and it's variants (Σ, σ, ς) (https://github.com/apache/lucene-solr/pull/1541#discussion_r441002695). I spent some time investigating all of the characters in Unicode and tried to explain edge cases within the `RegExp` class by enumerating the classes of characters and their behaviors so we can easily discuss or pivot as desired. I then opted to handle these special classes of characters the same as how `java.util.regex.Pattern` handles these. Often (though I haven't tested all code points) Perl regex seems to treat case insensitivity the same as the `Pattern` class. So for instance the `Pattern` class when using both `Pattern.CASE_INSENSITIVE` and `Pattern.UNICODE_CASE` matching flags will treat the three sigma characters (Σ, σ, ς) as the same for the purposes of matching; so for instance `σ` and `ς` are a positive match in a case insensitive regex even though they are not th emselves in a case sensitive context (they are both lowercase). For this PR I've opted to maintain the best performance possible scenario so special cases are handled with a lookup table while compiling the regex to an `Automaton`. Matching then must only consider the necessary alternative characters to ensure a match without any additional consideration for case sensitivity. While most characters can be easily handled by uppercasing or lowercasing them appropriately outside the ASCII range, during my review of the Unicode spec I encountered three distinct classes of characters in Unicode that can be problematic. * Class 1 is the set of characters such as sigma that match other characters outside of the immediate `toUpperCase` and `toLowerCase` forms. However, in spidering the Unicode table with some utility classes I found that no more than four characters were ever included in a set of matched alternatives. * Class 2 is the set of the characters that have a both a different and distinct `toUpperCase` and `toLowerCase` form. An example of this is: `Dž` whose upper case form is `DŽ` and whose lower case form is `dž`. * Class 3 is the set of characters that have some cased form that transitions the Basic Multilingual Plane (BMP). These sets of characters are typically not matched by `java.util.regex.Pattern` (for likely performance reasons as they transition from say 2 byte representations to 4 byte representations in UTF-8) and so are explicitly excluded from matching here as well, which as it turns out is relatively easy as the Java's `Character` class `toUpperCase` and `toLowerCase` methods for code points do not case these characters, which we were already using in RegExp. It's worth noting that supporting this is completely possible and Java's `String` class `toUpperCase` and `toLowerCase` methods do correctly transition characters across the BMP. Lastly it's worth noting that these classes can and do overlap and so if a character is in both class 3 and class 2 for instance then again we defer the behavior seen in `java.util.regex.Pattern` and ignore the casing across the BMP but do cas e appropriately otherwise such as with `ῼ` which will match it's lowercase form `ῳ` in the BMP but not it's uppercase form outside the BMP `ΩΙ`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org