[PR] Unicode Support for Case Insensitive Matching in RegExp [lucene]

via GitHub Mon, 03 Feb 2025 11:05:51 -0800


john-wagster opened a new pull request, #14192:
URL: https://github.com/apache/lucene/pull/14192


   About four years ago ASCII-only case insensitive matching 
(https://github.com/apache/lucene-solr/pull/1541) was added to Lucene.  In the 
past couple of a years a couple of requests have been made related to case 
insensitive matching in Elasticsearch across other parts of UTF-8 which uses 
the `RegExp` regex `Automaton` in Lucene.  Previous discussions around this for 
the ASCII-only work suggested that this task may be controversial.  So I've 
spent a bit of time exploring options and have submitted this PR as, I believe, 
the best direction to take related to that support, but welcome feedback on 
this approach.  
   
   tl;dr the approach I've taken is to mirror `java.util.regex.Pattern` with 
the belief that any downstream users of products like Elasticsearch would 
expect and welcome consistency in Unicode's edge-cases with case insensitivity 
being handled the same between both Java's `Pattern` class and Lucene's 
`RegExp` class
   
   More specifically @jpountz brought up concerns around the handling of 
characters such as sigma and it's variants (Σ, σ, ς) 
(https://github.com/apache/lucene-solr/pull/1541#discussion_r441002695).  I 
spent some time investigating all of the characters in Unicode and tried to 
explain edge cases within the `RegExp` class by enumerating the classes of 
characters and their behaviors so we can easily discuss or pivot as desired.  I 
then opted to handle these special classes of characters the same as how 
`java.util.regex.Pattern` handles these.  Often (though I haven't tested all 
code points) Perl regex seems to treat case insensitivity the same as the 
`Pattern` class.  So for instance the `Pattern` class when using both 
`Pattern.CASE_INSENSITIVE` and `Pattern.UNICODE_CASE` matching flags will treat 
the three sigma characters (Σ, σ, ς) as the same for the purposes of matching; 
so for instance `σ` and `ς` are a positive match in a case insensitive regex 
even though they are not th
 emselves in a case sensitive context (they are both lowercase). 
   
   For this PR I've opted to maintain the best performance possible scenario so 
special cases are handled with a lookup table while compiling the regex to an 
`Automaton`.  Matching then must only consider the necessary alternative 
characters to ensure a match without any additional consideration for case 
sensitivity.  
   
   While most characters can be easily handled by uppercasing or lowercasing 
them appropriately outside the ASCII range, during my review of the Unicode 
spec I encountered three distinct classes of characters in Unicode that can be 
problematic.  
   * Class 1 is the set of characters such as sigma that match other characters 
outside of the immediate `toUpperCase` and `toLowerCase` forms.  However, in 
spidering the Unicode table with some utility classes I found that no more than 
four characters were ever included in a set of matched alternatives.  
   * Class 2 is the set of the characters that have a both a different and 
distinct `toUpperCase` and `toLowerCase` form.  An example of this is: `ǅ` 
whose upper case form is `Ǆ` and whose lower case form is `ǆ`.  
   * Class 3 is the set of characters that have some cased form that 
transitions the Basic Multilingual Plane (BMP).  These sets of characters are 
typically not matched by `java.util.regex.Pattern` (for likely performance 
reasons as they transition from say 2 byte representations to 4 byte 
representations in UTF-8) and so are explicitly excluded from matching here as 
well, which as it turns out is relatively easy as the Java's `Character` class 
`toUpperCase` and `toLowerCase` methods for code points do not case these 
characters, which we were already using in RegExp.  It's worth noting that 
supporting this is completely possible and Java's `String` class `toUpperCase` 
and `toLowerCase` methods do correctly transition characters across the BMP.  
Lastly it's worth noting that these classes can and do overlap and so if a 
character is in both class 3 and class 2 for instance then again we defer the 
behavior seen in `java.util.regex.Pattern` and ignore the casing across the BMP 
but do cas
 e appropriately otherwise such as with `ῼ` which will match it's lowercase 
form `ῳ` in the BMP but not it's uppercase form outside the BMP `ΩΙ`.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[PR] Unicode Support for Case Insensitive Matching in RegExp [lucene]

Reply via email to