On Wed, 25 Jun 2025 18:51:52 GMT, Xueming Shen <sher...@openjdk.org> wrote:

> The root cause is an off-by-one bug introduced in an old change we made years 
> ago for Pattern.CANON_EQ.
> See https://cr.openjdk.org/~sherman/regexCE/Note.txt for background info.
> 
> As described in the writeup above the basic logic of the change is to:
> 
> **generate the permutations, create the alternation and then put it 
> appropriately into the character class (logically), we now use a special 
> "Node", the NFCCharProperty to do the matching work. The NFCCharProperty 
> tries to match a grapheme cluster at a time (nfc greedly, then backtrack) 
> against the character class.**
> 
> It appears we have a off-by-one bug in the backtrack boundary condition 
> check, when it backtracking to the position 'after' the base(main) character 
> (in case where the resulting 'nfc' string is not a **single character'** 
> string /not match). In such cases, we still need to match/compare the base 
> character against the _predicate_ to find the potential match. 
> 
> For example in the reported scenario, the target string contains the pair of 
> **u+2764** (emoji) + **u+fe0f** (variation selector/emoji_component). The 
> boundary edge j = Grapheme.nextBoundary() starts at **2** (after u+fe0f), 
> then it backtracks to 1. The current boundary check implementation 
> incorrectly exits here because 0 + 1 < 1 fails, which is incorrect. 
> 
> This emoji pair should match correctly, s showed below
> 
> 
> jshell> var p = Pattern.compile("\\p{IsEmoji}\\p{IsEmoji_Component}", 
> Pattern.CANON_EQ);
> p ==> \p{IsEmoji}\p{IsEmoji_Component}
> 
> jshell> p.matcher("\u2764\ufe0f").matches();
> $53 ==> true
> 
> 
> or
> 
> 
> jshell> var p = Pattern.compile("\\p{IsEmoji}", Pattern.CANON_EQ);
> p ==> \p{IsEmoji}
> 
> jshell> p.matcher("\u2764\ufe0f").find();
> $55 ==> true
> 
> 
> This bug is not limited to the emoji + variation selector pairs (which don't 
> 'nfc' into a single character, even are treated as a single grapheme 
> cluster). It also impacts cases involing dangling or unmatched combining 
> character(s). For example, the following should work/match/find, even in 
> Pattern.CANON_EQ mode.
> 
> 
> 
> jshell> p = Pattern.compile("\\p{IsGreek}\\p{IsAlphabetic}", 
> Pattern.CANON_EQ);
> p ==> \p{IsGreek}\p{IsAlphabetic}
> 
> jshell> p.matcher("\u1f80\u0345").matches();
> $57 ==> true
> 
> jshell> p = Pattern.compile("[\\p{IsAlphabetic}]*", Pattern.CANON_EQ);
> p ==> [\p{IsAlphabetic}]*
> 
> jshell> p.matcher("\u1f80\u0345").matches();
> $59 ==> true
> 
> 
> **note:** the grapheme boundary is not necessary the same as the resulting 
> nfc boundary.

This pull request has now been integrated.

Changeset: 61a590e9
Author:    Xueming Shen <sher...@openjdk.org>
URL:       
https://git.openjdk.org/jdk/commit/61a590e9bea64ddfd465a5e6f224bc2979d841e9
Stats:     19 lines in 2 files changed: 15 ins; 3 del; 1 mod

8354490: Pattern.CANON_EQ causes a pattern to not match a string with a UNICODE 
variation

Reviewed-by: rriggs, naoto

-------------

PR: https://git.openjdk.org/jdk/pull/25986

Reply via email to