On Wed, 25 Jun 2025 18:51:52 GMT, Xueming Shen <sher...@openjdk.org> wrote:
> The root cause is an off-by-one bug introduced in an old change we made years > ago for Pattern.CANON_EQ. > See https://cr.openjdk.org/~sherman/regexCE/Note.txt for background info. > > As described in the writeup above the basic logic of the change is to: > > **generate the permutations, create the alternation and then put it > appropriately into the character class (logically), we now use a special > "Node", the NFCCharProperty to do the matching work. The NFCCharProperty > tries to match a grapheme cluster at a time (nfc greedly, then backtrack) > against the character class.** > > It appears we have a off-by-one bug in the backtrack boundary condition > check, when it backtracking to the position 'after' the base(main) character > (in case where the resulting 'nfc' string is not a **single character'** > string /not match). In such cases, we still need to match/compare the base > character against the _predicate_ to find the potential match. > > For example in the reported scenario, the target string contains the pair of > **u+2764** (emoji) + **u+fe0f** (variation selector/emoji_component). The > boundary edge j = Grapheme.nextBoundary() starts at **2** (after u+fe0f), > then it backtracks to 1. The current boundary check implementation > incorrectly exits here because 0 + 1 < 1 fails, which is incorrect. > > This emoji pair should match correctly, s showed below > > > jshell> var p = Pattern.compile("\\p{IsEmoji}\\p{IsEmoji_Component}", > Pattern.CANON_EQ); > p ==> \p{IsEmoji}\p{IsEmoji_Component} > > jshell> p.matcher("\u2764\ufe0f").matches(); > $53 ==> true > > > or > > > jshell> var p = Pattern.compile("\\p{IsEmoji}", Pattern.CANON_EQ); > p ==> \p{IsEmoji} > > jshell> p.matcher("\u2764\ufe0f").find(); > $55 ==> true > > > This bug is not limited to the emoji + variation selector pairs (which don't > 'nfc' into a single character, even are treated as a single grapheme > cluster). It also impacts cases involing dangling or unmatched combining > character(s). For example, the following should work/match/find, even in > Pattern.CANON_EQ mode. > > > > jshell> p = Pattern.compile("\\p{IsGreek}\\p{IsAlphabetic}", > Pattern.CANON_EQ); > p ==> \p{IsGreek}\p{IsAlphabetic} > > jshell> p.matcher("\u1f80\u0345").matches(); > $57 ==> true > > jshell> p = Pattern.compile("[\\p{IsAlphabetic}]*", Pattern.CANON_EQ); > p ==> [\p{IsAlphabetic}]* > > jshell> p.matcher("\u1f80\u0345").matches(); > $59 ==> true > > > **note:** the grapheme boundary is not necessary the same as the resulting > nfc boundary. Looks good, thanks ------------- Marked as reviewed by rriggs (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/25986#pullrequestreview-2971400452