The u*_grapheme_next and u*_grapheme_prev functions were not updated when the Unicode algorithm for grapheme cluster breaks became more complicated due to Indic characters, Emojis and regional indicators.
This series of patches fixes that. 2025-05-18 Bruno Haible <br...@clisp.org> unigbrk/u*-grapheme-prev: Support Indic, Emojis, regional indicators. Reported by Kang-Che Sung <explore...@gmail.com> in <https://lists.gnu.org/archive/html/bug-libunistring/2025-03/msg00000.html>. * lib/unigbrk/u-grapheme-prev.h: New file, based on lib/unigbrk/u-grapheme-breaks.h. * lib/unigbrk/u8-grapheme-prev.c: Include unictype.h and u-grapheme-prev.h. (u8_grapheme_prev): Remove function. * lib/unigbrk/u16-grapheme-prev.c: Include unictype.h and u-grapheme-prev.h. (u16_grapheme_prev): Remove function. * lib/unigbrk/u32-grapheme-prev.c: Include unictype.h and u-grapheme-prev.h. (u32_grapheme_prev): Remove function. * modules/unigbrk/u8-grapheme-prev (Files): Add lib/unigbrk/u-grapheme-prev.h. (Depends-on): Remove unigbrk/uc-is-grapheme-break. Add unigbrk/base, unigbrk/uc-gbrk-prop, unictype/incb-of, unictype/property-extended-pictographic, bool. (configure.ac): Bump required libunistring version. * modules/unigbrk/u16-grapheme-prev (Files): Add lib/unigbrk/u-grapheme-prev.h. (Depends-on): Remove unigbrk/uc-is-grapheme-break. Add unigbrk/base, unigbrk/uc-gbrk-prop, unictype/incb-of, unictype/property-extended-pictographic, bool. (configure.ac): Bump required libunistring version. * modules/unigbrk/u32-grapheme-prev (Files): Add lib/unigbrk/u-grapheme-prev.h. (Depends-on): Remove unigbrk/uc-is-grapheme-break. Add unigbrk/base, unigbrk/uc-gbrk-prop, unictype/incb-of, unictype/property-extended-pictographic, bool. (configure.ac): Bump required libunistring version. * tests/unigbrk/test-u8-grapheme-prev.c (main): Add more test cases, from tests/unigbrk/test-u8-grapheme-breaks.c. * tests/unigbrk/test-u16-grapheme-prev.c (main): Add more test cases, from tests/unigbrk/test-u16-grapheme-breaks.c. * tests/unigbrk/test-u32-grapheme-prev.c (main): Add more test cases, from tests/unigbrk/test-u32-grapheme-breaks.c. 2025-05-18 Bruno Haible <br...@clisp.org> unigbrk/u*-grapheme-next: Support Indic, Emojis, regional indicators. Reported by Kang-Che Sung <explore...@gmail.com> in <https://lists.gnu.org/archive/html/bug-libunistring/2025-03/msg00000.html> and by Lich <aut...@lch361.net> in <https://lists.gnu.org/archive/html/bug-libunistring/2025-05/msg00000.html>. * lib/unigbrk/u-grapheme-next.h: New file, based on lib/unigbrk/u-grapheme-breaks.h. * lib/unigbrk/u8-grapheme-next.c: Include unictype.h and u-grapheme-next.h. (u8_grapheme_next): Remove function. * lib/unigbrk/u16-grapheme-next.c: Include unictype.h and u-grapheme-next.h. (u16_grapheme_next): Remove function. * lib/unigbrk/u32-grapheme-next.c: Include unictype.h and u-grapheme-next.h. (u32_grapheme_next): Remove function. * modules/unigbrk/u8-grapheme-next (Files): Add lib/unigbrk/u-grapheme-next.h. (Depends-on): Remove unigbrk/uc-is-grapheme-break. Add unigbrk/base, unigbrk/uc-gbrk-prop, unictype/incb-of, unictype/property-extended-pictographic, bool. (configure.ac): Bump required libunistring version. * modules/unigbrk/u16-grapheme-next (Files): Add lib/unigbrk/u-grapheme-next.h. (Depends-on): Remove unigbrk/uc-is-grapheme-break. Add unigbrk/base, unigbrk/uc-gbrk-prop, unictype/incb-of, unictype/property-extended-pictographic, bool. (configure.ac): Bump required libunistring version. * modules/unigbrk/u32-grapheme-next (Files): Add lib/unigbrk/u-grapheme-next.h. (Depends-on): Remove unigbrk/uc-is-grapheme-break. Add unigbrk/base, unigbrk/uc-gbrk-prop, unictype/incb-of, unictype/property-extended-pictographic, bool. (configure.ac): Bump required libunistring version. * tests/unigbrk/test-u8-grapheme-next.c (main): Add more test cases, from tests/unigbrk/test-u8-grapheme-breaks.c. * tests/unigbrk/test-u16-grapheme-next.c (main): Add more test cases, from tests/unigbrk/test-u16-grapheme-breaks.c. * tests/unigbrk/test-u32-grapheme-next.c (main): Add more test cases, from tests/unigbrk/test-u32-grapheme-breaks.c. 2025-05-18 Bruno Haible <br...@clisp.org> unigbrk/u*-grapheme-breaks: Tiny optimization. * lib/unigbrk/u-grapheme-breaks.h (FUNC): Exploit the fact that n > 0.
>From 2d37d77f6c9803541a9f773df29a4b90f65f4a4d Mon Sep 17 00:00:00 2001 From: Bruno Haible <br...@clisp.org> Date: Mon, 19 May 2025 01:52:59 +0200 Subject: [PATCH 1/3] unigbrk/u*-grapheme-breaks: Tiny optimization. * lib/unigbrk/u-grapheme-breaks.h (FUNC): Exploit the fact that n > 0. --- ChangeLog | 5 +++++ lib/unigbrk/u-grapheme-breaks.h | 7 +++++-- 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/ChangeLog b/ChangeLog index 32a3bb93a3..f9eecbc588 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,8 @@ +2025-05-18 Bruno Haible <br...@clisp.org> + + unigbrk/u*-grapheme-breaks: Tiny optimization. + * lib/unigbrk/u-grapheme-breaks.h (FUNC): Exploit the fact that n > 0. + 2025-05-17 Bruno Haible <br...@clisp.org> cosh: Add more tests. diff --git a/lib/unigbrk/u-grapheme-breaks.h b/lib/unigbrk/u-grapheme-breaks.h index 4066c9e6bb..30d5853a45 100644 --- a/lib/unigbrk/u-grapheme-breaks.h +++ b/lib/unigbrk/u-grapheme-breaks.h @@ -1,6 +1,5 @@ /* Grapheme cluster break function. Copyright (C) 2010-2025 Free Software Foundation, Inc. - Written by Ben Pfaff <b...@cs.stanford.edu>, 2010. This file is free software. It is dual-licensed under "the GNU LGPLv3+ or the GNU GPLv2+". @@ -23,6 +22,8 @@ License and of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>. */ +/* Written by Ben Pfaff, Daiki Ueno, Bruno Haible. */ + /* This file implements section 3 "Grapheme Cluster Boundaries" of Unicode Standard Annex #29 <https://www.unicode.org/reports/tr29/>. */ @@ -61,8 +62,9 @@ FUNC (const UNIT *s, size_t n, char *p) /* Don't break inside multibyte characters. */ memset (p, 0, n); - while (s < s_end) + do { + /* Invariant: Here s < s_end. */ ucs4_t uc; int count = U_MBTOUC (&uc, s, s_end - s); int prop = uc_graphemeclusterbreak_property (uc); @@ -157,5 +159,6 @@ FUNC (const UNIT *s, size_t n, char *p) s += count; p += count; } + while (s < s_end); } } -- 2.43.0
From f783bbc7678359628bdd36a3c53a5af79c1e75a4 Mon Sep 17 00:00:00 2001 From: Bruno Haible <br...@clisp.org> Date: Mon, 19 May 2025 01:53:23 +0200 Subject: [PATCH 2/3] unigbrk/u*-grapheme-next: Support Indic, Emojis, regional indicators. Reported by Kang-Che Sung <explore...@gmail.com> in <https://lists.gnu.org/archive/html/bug-libunistring/2025-03/msg00000.html> and by Lich <aut...@lch361.net> in <https://lists.gnu.org/archive/html/bug-libunistring/2025-05/msg00000.html>. * lib/unigbrk/u-grapheme-next.h: New file, based on lib/unigbrk/u-grapheme-breaks.h. * lib/unigbrk/u8-grapheme-next.c: Include unictype.h and u-grapheme-next.h. (u8_grapheme_next): Remove function. * lib/unigbrk/u16-grapheme-next.c: Include unictype.h and u-grapheme-next.h. (u16_grapheme_next): Remove function. * lib/unigbrk/u32-grapheme-next.c: Include unictype.h and u-grapheme-next.h. (u32_grapheme_next): Remove function. * modules/unigbrk/u8-grapheme-next (Files): Add lib/unigbrk/u-grapheme-next.h. (Depends-on): Remove unigbrk/uc-is-grapheme-break. Add unigbrk/base, unigbrk/uc-gbrk-prop, unictype/incb-of, unictype/property-extended-pictographic, bool. (configure.ac): Bump required libunistring version. * modules/unigbrk/u16-grapheme-next (Files): Add lib/unigbrk/u-grapheme-next.h. (Depends-on): Remove unigbrk/uc-is-grapheme-break. Add unigbrk/base, unigbrk/uc-gbrk-prop, unictype/incb-of, unictype/property-extended-pictographic, bool. (configure.ac): Bump required libunistring version. * modules/unigbrk/u32-grapheme-next (Files): Add lib/unigbrk/u-grapheme-next.h. (Depends-on): Remove unigbrk/uc-is-grapheme-break. Add unigbrk/base, unigbrk/uc-gbrk-prop, unictype/incb-of, unictype/property-extended-pictographic, bool. (configure.ac): Bump required libunistring version. * tests/unigbrk/test-u8-grapheme-next.c (main): Add more test cases, from tests/unigbrk/test-u8-grapheme-breaks.c. * tests/unigbrk/test-u16-grapheme-next.c (main): Add more test cases, from tests/unigbrk/test-u16-grapheme-breaks.c. * tests/unigbrk/test-u32-grapheme-next.c (main): Add more test cases, from tests/unigbrk/test-u32-grapheme-breaks.c. --- ChangeLog | 43 +++++++ lib/unigbrk/u-grapheme-next.h | 159 +++++++++++++++++++++++++ lib/unigbrk/u16-grapheme-next.c | 30 ++--- lib/unigbrk/u32-grapheme-next.c | 30 ++--- lib/unigbrk/u8-grapheme-next.c | 30 ++--- modules/unigbrk/u16-grapheme-next | 9 +- modules/unigbrk/u32-grapheme-next | 9 +- modules/unigbrk/u8-grapheme-next | 9 +- tests/unigbrk/test-u16-grapheme-next.c | 9 ++ tests/unigbrk/test-u32-grapheme-next.c | 9 ++ tests/unigbrk/test-u8-grapheme-next.c | 11 ++ 11 files changed, 273 insertions(+), 75 deletions(-) create mode 100644 lib/unigbrk/u-grapheme-next.h diff --git a/ChangeLog b/ChangeLog index f9eecbc588..b53bca0cb3 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,46 @@ +2025-05-18 Bruno Haible <br...@clisp.org> + + unigbrk/u*-grapheme-next: Support Indic, Emojis, regional indicators. + Reported by Kang-Che Sung <explore...@gmail.com> in + <https://lists.gnu.org/archive/html/bug-libunistring/2025-03/msg00000.html> + and by Lich <aut...@lch361.net> in + <https://lists.gnu.org/archive/html/bug-libunistring/2025-05/msg00000.html>. + * lib/unigbrk/u-grapheme-next.h: New file, based on + lib/unigbrk/u-grapheme-breaks.h. + * lib/unigbrk/u8-grapheme-next.c: Include unictype.h and + u-grapheme-next.h. + (u8_grapheme_next): Remove function. + * lib/unigbrk/u16-grapheme-next.c: Include unictype.h and + u-grapheme-next.h. + (u16_grapheme_next): Remove function. + * lib/unigbrk/u32-grapheme-next.c: Include unictype.h and + u-grapheme-next.h. + (u32_grapheme_next): Remove function. + * modules/unigbrk/u8-grapheme-next (Files): Add + lib/unigbrk/u-grapheme-next.h. + (Depends-on): Remove unigbrk/uc-is-grapheme-break. Add unigbrk/base, + unigbrk/uc-gbrk-prop, unictype/incb-of, + unictype/property-extended-pictographic, bool. + (configure.ac): Bump required libunistring version. + * modules/unigbrk/u16-grapheme-next (Files): Add + lib/unigbrk/u-grapheme-next.h. + (Depends-on): Remove unigbrk/uc-is-grapheme-break. Add unigbrk/base, + unigbrk/uc-gbrk-prop, unictype/incb-of, + unictype/property-extended-pictographic, bool. + (configure.ac): Bump required libunistring version. + * modules/unigbrk/u32-grapheme-next (Files): Add + lib/unigbrk/u-grapheme-next.h. + (Depends-on): Remove unigbrk/uc-is-grapheme-break. Add unigbrk/base, + unigbrk/uc-gbrk-prop, unictype/incb-of, + unictype/property-extended-pictographic, bool. + (configure.ac): Bump required libunistring version. + * tests/unigbrk/test-u8-grapheme-next.c (main): Add more test cases, + from tests/unigbrk/test-u8-grapheme-breaks.c. + * tests/unigbrk/test-u16-grapheme-next.c (main): Add more test cases, + from tests/unigbrk/test-u16-grapheme-breaks.c. + * tests/unigbrk/test-u32-grapheme-next.c (main): Add more test cases, + from tests/unigbrk/test-u32-grapheme-breaks.c. + 2025-05-18 Bruno Haible <br...@clisp.org> unigbrk/u*-grapheme-breaks: Tiny optimization. diff --git a/lib/unigbrk/u-grapheme-next.h b/lib/unigbrk/u-grapheme-next.h new file mode 100644 index 0000000000..9ca07436e9 --- /dev/null +++ b/lib/unigbrk/u-grapheme-next.h @@ -0,0 +1,159 @@ +/* Grapheme cluster break function. + Copyright (C) 2010-2025 Free Software Foundation, Inc. + + This file is free software. + It is dual-licensed under "the GNU LGPLv3+ or the GNU GPLv2+". + You can redistribute it and/or modify it under either + - the terms of the GNU Lesser General Public License as published + by the Free Software Foundation, either version 3, or (at your + option) any later version, or + - the terms of the GNU General Public License as published by the + Free Software Foundation; either version 2, or (at your option) + any later version, or + - the same dual license "the GNU LGPLv3+ or the GNU GPLv2+". + + This file is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License and the GNU General Public License + for more details. + + You should have received a copy of the GNU Lesser General Public + License and of the GNU General Public License along with this + program. If not, see <https://www.gnu.org/licenses/>. */ + +/* Written by Ben Pfaff, Daiki Ueno, Bruno Haible. */ + +/* This file implements section 3 "Grapheme Cluster Boundaries" + of Unicode Standard Annex #29 <https://www.unicode.org/reports/tr29/>. */ + +const UNIT * +FUNC (const UNIT *s, const UNIT *s_end) +{ + if (s == s_end) + return NULL; + + /* Grapheme Cluster break property of the last character. + -1 at the very beginning of the string. */ + int last_char_prop = -1; + + /* True if the last character ends a sequence of Indic_Conjunct_Break + values: consonant {extend|linker}* */ + bool incb_consonant_extended = false; + /* True if the last character ends a sequence of Indic_Conjunct_Break + values: consonant {extend|linker}* linker */ + bool incb_consonant_extended_linker = false; + /* True if the last character ends a sequence of Indic_Conjunct_Break + values: consonant {extend|linker}* linker {extend|linker}* */ + bool incb_consonant_extended_linker_extended = false; + + /* True if the last character ends an emoji modifier sequence + \p{Extended_Pictographic} Extend*. */ + bool emoji_modifier_sequence = false; + /* True if the last character was immediately preceded by an + emoji modifier sequence \p{Extended_Pictographic} Extend*. */ + bool emoji_modifier_sequence_before_last_char = false; + + /* Number of consecutive regional indicator (RI) characters seen + immediately before the current point. */ + size_t ri_count = 0; + + do + { + ucs4_t uc; + int count = U_MBTOUC (&uc, s, s_end - s); + int prop = uc_graphemeclusterbreak_property (uc); + int incb = uc_indic_conjunct_break (uc); + + /* Break at the start of the string (GB1). */ + if (last_char_prop < 0) + /* *p = 1 */; + else + { + /* No break between CR and LF (GB3). */ + if (last_char_prop == GBP_CR && prop == GBP_LF) + /* *p = 0 */; + /* Break before and after newlines (GB4, GB5). */ + else if ((last_char_prop == GBP_CR + || last_char_prop == GBP_LF + || last_char_prop == GBP_CONTROL) + || (prop == GBP_CR + || prop == GBP_LF + || prop == GBP_CONTROL)) + break /* *p = 1 */; + /* No break between Hangul syllable sequences (GB6, GB7, GB8). */ + else if ((last_char_prop == GBP_L + && (prop == GBP_L + || prop == GBP_V + || prop == GBP_LV + || prop == GBP_LVT)) + || ((last_char_prop == GBP_LV + || last_char_prop == GBP_V) + && (prop == GBP_V + || prop == GBP_T)) + || ((last_char_prop == GBP_LVT + || last_char_prop == GBP_T) + && prop == GBP_T)) + /* *p = 0 */; + /* No break before extending characters or ZWJ (GB9). */ + else if (prop == GBP_EXTEND || prop == GBP_ZWJ) + /* *p = 0 */; + /* No break before SpacingMarks (GB9a). */ + else if (prop == GBP_SPACINGMARK) + /* *p = 0 */; + /* No break after Prepend characters (GB9b). */ + else if (last_char_prop == GBP_PREPEND) + /* *p = 0 */; + /* No break within certain combinations of Indic_Conjunct_Break + values: Between + consonant {extend|linker}* linker {extend|linker}* + and + consonant + (GB9c). */ + else if (incb_consonant_extended_linker_extended + && incb == UC_INDIC_CONJUNCT_BREAK_CONSONANT) + /* *p = 0 */; + /* No break within emoji modifier sequences or emoji zwj sequences + (GB11). */ + else if (last_char_prop == GBP_ZWJ + && emoji_modifier_sequence_before_last_char + && uc_is_property_extended_pictographic (uc)) + /* *p = 0 */; + /* No break between RI if there is an odd number of RI + characters before (GB12, GB13). */ + else if (prop == GBP_RI && (ri_count % 2) != 0) + /* *p = 0 */; + /* Break everywhere (GB999). */ + else + break /* *p = 1 */; + } + + incb_consonant_extended_linker = + incb_consonant_extended && incb == UC_INDIC_CONJUNCT_BREAK_LINKER; + incb_consonant_extended_linker_extended = + (incb_consonant_extended_linker + || (incb_consonant_extended_linker_extended + && incb >= UC_INDIC_CONJUNCT_BREAK_LINKER)); + incb_consonant_extended = + (incb == UC_INDIC_CONJUNCT_BREAK_CONSONANT + || (incb_consonant_extended + && incb >= UC_INDIC_CONJUNCT_BREAK_LINKER)); + + emoji_modifier_sequence_before_last_char = emoji_modifier_sequence; + emoji_modifier_sequence = + (emoji_modifier_sequence && prop == GBP_EXTEND) + || uc_is_property_extended_pictographic (uc); + + last_char_prop = prop; + + if (prop == GBP_RI) + ri_count++; + else + ri_count = 0; + + s += count; + } + while (s < s_end); + + return s; +} diff --git a/lib/unigbrk/u16-grapheme-next.c b/lib/unigbrk/u16-grapheme-next.c index b0e47e17c8..5e7a783d8f 100644 --- a/lib/unigbrk/u16-grapheme-next.c +++ b/lib/unigbrk/u16-grapheme-next.c @@ -1,6 +1,5 @@ /* Next grapheme cluster function. Copyright (C) 2010-2025 Free Software Foundation, Inc. - Written by Ben Pfaff <b...@cs.stanford.edu>, 2010. This file is free software. It is dual-licensed under "the GNU LGPLv3+ or the GNU GPLv2+". @@ -23,6 +22,8 @@ License and of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>. */ +/* Written by Bruno Haible <br...@clisp.org>, 2025. */ + /* Don't use the const-improved function macros in this compilation unit. */ #define _LIBUNISTRING_NO_CONST_GENERICS @@ -31,27 +32,10 @@ /* Specification. */ #include "unigbrk.h" +#include "unictype.h" #include "unistr.h" -const uint16_t * -u16_grapheme_next (const uint16_t *s, const uint16_t *end) -{ - ucs4_t prev; - int mblen; - - if (s == end) - return NULL; - - for (s += u16_mbtouc (&prev, s, end - s); s != end; s += mblen) - { - ucs4_t next; - - mblen = u16_mbtouc (&next, s, end - s); - if (uc_is_grapheme_break (prev, next)) - break; - - prev = next; - } - - return s; -} +#define FUNC u16_grapheme_next +#define UNIT uint16_t +#define U_MBTOUC u16_mbtouc +#include "u-grapheme-next.h" diff --git a/lib/unigbrk/u32-grapheme-next.c b/lib/unigbrk/u32-grapheme-next.c index 28fc5052e5..1c9adfa6f3 100644 --- a/lib/unigbrk/u32-grapheme-next.c +++ b/lib/unigbrk/u32-grapheme-next.c @@ -1,6 +1,5 @@ /* Next grapheme cluster function. Copyright (C) 2010-2025 Free Software Foundation, Inc. - Written by Ben Pfaff <b...@cs.stanford.edu>, 2010. This file is free software. It is dual-licensed under "the GNU LGPLv3+ or the GNU GPLv2+". @@ -23,6 +22,8 @@ License and of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>. */ +/* Written by Bruno Haible <br...@clisp.org>, 2025. */ + /* Don't use the const-improved function macros in this compilation unit. */ #define _LIBUNISTRING_NO_CONST_GENERICS @@ -31,27 +32,10 @@ /* Specification. */ #include "unigbrk.h" +#include "unictype.h" #include "unistr.h" -const uint32_t * -u32_grapheme_next (const uint32_t *s, const uint32_t *end) -{ - ucs4_t prev; - - if (s == end) - return NULL; - - u32_mbtouc (&prev, s, end - s); - for (s++; s != end; s++) - { - ucs4_t next; - - u32_mbtouc (&next, s, end - s); - if (uc_is_grapheme_break (prev, next)) - break; - - prev = next; - } - - return s; -} +#define FUNC u32_grapheme_next +#define UNIT uint32_t +#define U_MBTOUC u32_mbtouc +#include "u-grapheme-next.h" diff --git a/lib/unigbrk/u8-grapheme-next.c b/lib/unigbrk/u8-grapheme-next.c index b1d2e3dd3e..2ec094da2a 100644 --- a/lib/unigbrk/u8-grapheme-next.c +++ b/lib/unigbrk/u8-grapheme-next.c @@ -1,6 +1,5 @@ /* Next grapheme cluster function. Copyright (C) 2010-2025 Free Software Foundation, Inc. - Written by Ben Pfaff <b...@cs.stanford.edu>, 2010. This file is free software. It is dual-licensed under "the GNU LGPLv3+ or the GNU GPLv2+". @@ -23,6 +22,8 @@ License and of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>. */ +/* Written by Bruno Haible <br...@clisp.org>, 2025. */ + /* Don't use the const-improved function macros in this compilation unit. */ #define _LIBUNISTRING_NO_CONST_GENERICS @@ -31,27 +32,10 @@ /* Specification. */ #include "unigbrk.h" +#include "unictype.h" #include "unistr.h" -const uint8_t * -u8_grapheme_next (const uint8_t *s, const uint8_t *end) -{ - ucs4_t prev; - int mblen; - - if (s == end) - return NULL; - - for (s += u8_mbtouc (&prev, s, end - s); s != end; s += mblen) - { - ucs4_t next; - - mblen = u8_mbtouc (&next, s, end - s); - if (uc_is_grapheme_break (prev, next)) - break; - - prev = next; - } - - return s; -} +#define FUNC u8_grapheme_next +#define UNIT uint8_t +#define U_MBTOUC u8_mbtouc +#include "u-grapheme-next.h" diff --git a/modules/unigbrk/u16-grapheme-next b/modules/unigbrk/u16-grapheme-next index 7443375331..8eb500a77d 100644 --- a/modules/unigbrk/u16-grapheme-next +++ b/modules/unigbrk/u16-grapheme-next @@ -3,15 +3,20 @@ Find start of next grapheme cluster in UTF-16 string. Files: lib/unigbrk/u16-grapheme-next.c +lib/unigbrk/u-grapheme-next.h tests/macros.h Depends-on: -unigbrk/uc-is-grapheme-break +unigbrk/base +unigbrk/uc-gbrk-prop +unictype/incb-of +unictype/property-extended-pictographic unistr/u16-mbtouc +bool configure.ac: gl_MODULE_INDICATOR([unigbrk/u16-grapheme-next]) -gl_LIBUNISTRING_MODULE([1.3], [unigbrk/u16-grapheme-next]) +gl_LIBUNISTRING_MODULE([1.4], [unigbrk/u16-grapheme-next]) Makefile.am: if LIBUNISTRING_COMPILE_UNIGBRK_U16_GRAPHEME_NEXT diff --git a/modules/unigbrk/u32-grapheme-next b/modules/unigbrk/u32-grapheme-next index 28daec526c..5045c390a9 100644 --- a/modules/unigbrk/u32-grapheme-next +++ b/modules/unigbrk/u32-grapheme-next @@ -3,15 +3,20 @@ Find start of next grapheme cluster in UTF-32 string. Files: lib/unigbrk/u32-grapheme-next.c +lib/unigbrk/u-grapheme-next.h tests/macros.h Depends-on: -unigbrk/uc-is-grapheme-break +unigbrk/base +unigbrk/uc-gbrk-prop +unictype/incb-of +unictype/property-extended-pictographic unistr/u32-mbtouc +bool configure.ac: gl_MODULE_INDICATOR([unigbrk/u32-grapheme-next]) -gl_LIBUNISTRING_MODULE([1.3], [unigbrk/u32-grapheme-next]) +gl_LIBUNISTRING_MODULE([1.4], [unigbrk/u32-grapheme-next]) Makefile.am: if LIBUNISTRING_COMPILE_UNIGBRK_U32_GRAPHEME_NEXT diff --git a/modules/unigbrk/u8-grapheme-next b/modules/unigbrk/u8-grapheme-next index 50fc06a5c0..c8197cd4e7 100644 --- a/modules/unigbrk/u8-grapheme-next +++ b/modules/unigbrk/u8-grapheme-next @@ -3,15 +3,20 @@ Find start of next grapheme cluster in UTF-8 string. Files: lib/unigbrk/u8-grapheme-next.c +lib/unigbrk/u-grapheme-next.h tests/macros.h Depends-on: -unigbrk/uc-is-grapheme-break +unigbrk/base +unigbrk/uc-gbrk-prop +unictype/incb-of +unictype/property-extended-pictographic unistr/u8-mbtouc +bool configure.ac: gl_MODULE_INDICATOR([unigbrk/u8-grapheme-next]) -gl_LIBUNISTRING_MODULE([1.3], [unigbrk/u8-grapheme-next]) +gl_LIBUNISTRING_MODULE([1.4], [unigbrk/u8-grapheme-next]) Makefile.am: if LIBUNISTRING_COMPILE_UNIGBRK_U8_GRAPHEME_NEXT diff --git a/tests/unigbrk/test-u16-grapheme-next.c b/tests/unigbrk/test-u16-grapheme-next.c index d2647a31a4..555770a96f 100644 --- a/tests/unigbrk/test-u16-grapheme-next.c +++ b/tests/unigbrk/test-u16-grapheme-next.c @@ -95,6 +95,15 @@ main (void) test_u16_grapheme_next (2, 'e', ACUTE, 'x', -1); test_u16_grapheme_next (2, 'e', ACUTE, 'e', ACUTE, -1); + /* CR LF handling. */ + test_u16_grapheme_next (2, '\r', '\n', 'd', -1); + + /* Emoji modifier / ZWJ sequence. */ + test_u16_grapheme_next (5, 0x2605, 0x0305, 0x0347, 0x200D, 0x2600, -1); + + /* Regional indicators. */ + test_u16_grapheme_next (4, 0xD83C, 0xDDE9, 0xD83C, 0xDDEA, 0xD83C, 0xDDEB, 0xD83C, 0xDDF7, -1); + /* Surrogate pairs. */ test_u16_grapheme_next (2, 0xd83d, 0xde10, -1); /* ????: neutral face. */ test_u16_grapheme_next (3, 0xd83d, 0xde10, GRAVE, -1); diff --git a/tests/unigbrk/test-u32-grapheme-next.c b/tests/unigbrk/test-u32-grapheme-next.c index 58fb1e2eb5..db3a1590f8 100644 --- a/tests/unigbrk/test-u32-grapheme-next.c +++ b/tests/unigbrk/test-u32-grapheme-next.c @@ -95,6 +95,15 @@ main (void) test_u32_grapheme_next (2, 'e', ACUTE, 'x', -1); test_u32_grapheme_next (2, 'e', ACUTE, 'e', ACUTE, -1); + /* CR LF handling. */ + test_u32_grapheme_next (2, '\r', '\n', 'd', -1); + + /* Emoji modifier / ZWJ sequence. */ + test_u32_grapheme_next (5, 0x2605, 0x0305, 0x0347, 0x200D, 0x2600, -1); + + /* Regional indicators. */ + test_u32_grapheme_next (2, 0x1F1E9, 0x1F1EA, 0x1F1EB, 0x1F1F7, -1); + /* Outside BMP. */ #define NEUTRAL_FACE 0x1f610 /* ????: neutral face. */ test_u32_grapheme_next (1, NEUTRAL_FACE, -1); diff --git a/tests/unigbrk/test-u8-grapheme-next.c b/tests/unigbrk/test-u8-grapheme-next.c index a818504bf6..00521639a3 100644 --- a/tests/unigbrk/test-u8-grapheme-next.c +++ b/tests/unigbrk/test-u8-grapheme-next.c @@ -76,5 +76,16 @@ main (void) test_u8_grapheme_next ("e"ACUTE"x", 4, 3); test_u8_grapheme_next ("e"ACUTE "e"ACUTE, 6, 3); + /* CR LF handling. */ + test_u8_grapheme_next ("\r\nd", 3, 2); + + /* Emoji modifier / ZWJ sequence. */ + test_u8_grapheme_next ("\342\230\205\314\205\315\207\342\200\215\342\230\200", + 13, 13); + + /* Regional indicators. */ + test_u8_grapheme_next ("\360\237\207\251\360\237\207\252\360\237\207\253\360\237\207\267", + 16, 8); + return test_exit_status; } -- 2.43.0
From 0a319fd506eeb3fa832e50c9a2ed5aa492401200 Mon Sep 17 00:00:00 2001 From: Bruno Haible <br...@clisp.org> Date: Mon, 19 May 2025 02:01:32 +0200 Subject: [PATCH 3/3] unigbrk/u*-grapheme-prev: Support Indic, Emojis, regional indicators. Reported by Kang-Che Sung <explore...@gmail.com> in <https://lists.gnu.org/archive/html/bug-libunistring/2025-03/msg00000.html>. * lib/unigbrk/u-grapheme-prev.h: New file, based on lib/unigbrk/u-grapheme-breaks.h. * lib/unigbrk/u8-grapheme-prev.c: Include unictype.h and u-grapheme-prev.h. (u8_grapheme_prev): Remove function. * lib/unigbrk/u16-grapheme-prev.c: Include unictype.h and u-grapheme-prev.h. (u16_grapheme_prev): Remove function. * lib/unigbrk/u32-grapheme-prev.c: Include unictype.h and u-grapheme-prev.h. (u32_grapheme_prev): Remove function. * modules/unigbrk/u8-grapheme-prev (Files): Add lib/unigbrk/u-grapheme-prev.h. (Depends-on): Remove unigbrk/uc-is-grapheme-break. Add unigbrk/base, unigbrk/uc-gbrk-prop, unictype/incb-of, unictype/property-extended-pictographic, bool. (configure.ac): Bump required libunistring version. * modules/unigbrk/u16-grapheme-prev (Files): Add lib/unigbrk/u-grapheme-prev.h. (Depends-on): Remove unigbrk/uc-is-grapheme-break. Add unigbrk/base, unigbrk/uc-gbrk-prop, unictype/incb-of, unictype/property-extended-pictographic, bool. (configure.ac): Bump required libunistring version. * modules/unigbrk/u32-grapheme-prev (Files): Add lib/unigbrk/u-grapheme-prev.h. (Depends-on): Remove unigbrk/uc-is-grapheme-break. Add unigbrk/base, unigbrk/uc-gbrk-prop, unictype/incb-of, unictype/property-extended-pictographic, bool. (configure.ac): Bump required libunistring version. * tests/unigbrk/test-u8-grapheme-prev.c (main): Add more test cases, from tests/unigbrk/test-u8-grapheme-breaks.c. * tests/unigbrk/test-u16-grapheme-prev.c (main): Add more test cases, from tests/unigbrk/test-u16-grapheme-breaks.c. * tests/unigbrk/test-u32-grapheme-prev.c (main): Add more test cases, from tests/unigbrk/test-u32-grapheme-breaks.c. --- ChangeLog | 41 +++++ lib/unigbrk/u-grapheme-prev.h | 233 +++++++++++++++++++++++++ lib/unigbrk/u16-grapheme-prev.c | 38 +--- lib/unigbrk/u32-grapheme-prev.c | 35 +--- lib/unigbrk/u8-grapheme-prev.c | 38 +--- modules/unigbrk/u16-grapheme-prev | 9 +- modules/unigbrk/u32-grapheme-prev | 9 +- modules/unigbrk/u8-grapheme-prev | 9 +- tests/unigbrk/test-u16-grapheme-prev.c | 9 + tests/unigbrk/test-u32-grapheme-prev.c | 9 + tests/unigbrk/test-u8-grapheme-prev.c | 11 ++ 11 files changed, 345 insertions(+), 96 deletions(-) create mode 100644 lib/unigbrk/u-grapheme-prev.h diff --git a/ChangeLog b/ChangeLog index b53bca0cb3..09c5b008df 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,44 @@ +2025-05-18 Bruno Haible <br...@clisp.org> + + unigbrk/u*-grapheme-prev: Support Indic, Emojis, regional indicators. + Reported by Kang-Che Sung <explore...@gmail.com> in + <https://lists.gnu.org/archive/html/bug-libunistring/2025-03/msg00000.html>. + * lib/unigbrk/u-grapheme-prev.h: New file, based on + lib/unigbrk/u-grapheme-breaks.h. + * lib/unigbrk/u8-grapheme-prev.c: Include unictype.h and + u-grapheme-prev.h. + (u8_grapheme_prev): Remove function. + * lib/unigbrk/u16-grapheme-prev.c: Include unictype.h and + u-grapheme-prev.h. + (u16_grapheme_prev): Remove function. + * lib/unigbrk/u32-grapheme-prev.c: Include unictype.h and + u-grapheme-prev.h. + (u32_grapheme_prev): Remove function. + * modules/unigbrk/u8-grapheme-prev (Files): Add + lib/unigbrk/u-grapheme-prev.h. + (Depends-on): Remove unigbrk/uc-is-grapheme-break. Add unigbrk/base, + unigbrk/uc-gbrk-prop, unictype/incb-of, + unictype/property-extended-pictographic, bool. + (configure.ac): Bump required libunistring version. + * modules/unigbrk/u16-grapheme-prev (Files): Add + lib/unigbrk/u-grapheme-prev.h. + (Depends-on): Remove unigbrk/uc-is-grapheme-break. Add unigbrk/base, + unigbrk/uc-gbrk-prop, unictype/incb-of, + unictype/property-extended-pictographic, bool. + (configure.ac): Bump required libunistring version. + * modules/unigbrk/u32-grapheme-prev (Files): Add + lib/unigbrk/u-grapheme-prev.h. + (Depends-on): Remove unigbrk/uc-is-grapheme-break. Add unigbrk/base, + unigbrk/uc-gbrk-prop, unictype/incb-of, + unictype/property-extended-pictographic, bool. + (configure.ac): Bump required libunistring version. + * tests/unigbrk/test-u8-grapheme-prev.c (main): Add more test cases, + from tests/unigbrk/test-u8-grapheme-breaks.c. + * tests/unigbrk/test-u16-grapheme-prev.c (main): Add more test cases, + from tests/unigbrk/test-u16-grapheme-breaks.c. + * tests/unigbrk/test-u32-grapheme-prev.c (main): Add more test cases, + from tests/unigbrk/test-u32-grapheme-breaks.c. + 2025-05-18 Bruno Haible <br...@clisp.org> unigbrk/u*-grapheme-next: Support Indic, Emojis, regional indicators. diff --git a/lib/unigbrk/u-grapheme-prev.h b/lib/unigbrk/u-grapheme-prev.h new file mode 100644 index 0000000000..0894d5992e --- /dev/null +++ b/lib/unigbrk/u-grapheme-prev.h @@ -0,0 +1,233 @@ +/* Grapheme cluster break function. + Copyright (C) 2010-2025 Free Software Foundation, Inc. + + This file is free software. + It is dual-licensed under "the GNU LGPLv3+ or the GNU GPLv2+". + You can redistribute it and/or modify it under either + - the terms of the GNU Lesser General Public License as published + by the Free Software Foundation, either version 3, or (at your + option) any later version, or + - the terms of the GNU General Public License as published by the + Free Software Foundation; either version 2, or (at your option) + any later version, or + - the same dual license "the GNU LGPLv3+ or the GNU GPLv2+". + + This file is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License and the GNU General Public License + for more details. + + You should have received a copy of the GNU Lesser General Public + License and of the GNU General Public License along with this + program. If not, see <https://www.gnu.org/licenses/>. */ + +/* Written by Bruno Haible <br...@clisp.org>, 2025. */ + +/* This file implements section 3 "Grapheme Cluster Boundaries" + of Unicode Standard Annex #29 <https://www.unicode.org/reports/tr29/> + backwards. */ + +/* Returns true if the string [s_start, s) ends with a sequence of + Indic_Conjunct_Break values like: + consonant {extend|linker}* linker {extend|linker}* + */ +static bool +ends_with_incb_consonant_extended_linker_extended (const UNIT *s, + const UNIT *s_start) +{ + /* Look for + consonant {extend|linker}* + with at least one linker. */ + bool seen_linker = false; + + while (s > s_start) + { + const UNIT *prev_s; + ucs4_t uc; + + prev_s = U_PREV (&uc, s, s_start); + if (prev_s == NULL) + /* Ill-formed UTF-8 encoding. */ + break; + + int incb = uc_indic_conjunct_break (uc); + if (incb == UC_INDIC_CONJUNCT_BREAK_CONSONANT) + return seen_linker; + if (!(incb >= UC_INDIC_CONJUNCT_BREAK_LINKER)) + break; + seen_linker |= (incb == UC_INDIC_CONJUNCT_BREAK_LINKER); + + s = prev_s; + } + + return false; +} + +/* Returns true if the string [s_start, s) ends with a sequence of + characters like: + \p{Extended_Pictographic} Extend* + */ +static bool +ends_with_emoji_modifier_sequence (const UNIT *s, const UNIT *s_start) +{ + while (s > s_start) + { + const UNIT *prev_s; + ucs4_t uc; + + prev_s = U_PREV (&uc, s, s_start); + if (prev_s == NULL) + /* Ill-formed UTF-8 encoding. */ + break; + + if (uc_is_property_extended_pictographic (uc)) + return true; + + if (uc_graphemeclusterbreak_property (uc) != GBP_EXTEND) + break; + + s = prev_s; + } + + return false; +} + +/* Returns the number of consecutive regional indicator (RI) characters + at the end of the string [s_start, s). */ +static size_t +ends_with_ri_count (const UNIT *s, const UNIT *s_start) +{ + size_t ri_count = 0; + + while (s > s_start) + { + const UNIT *prev_s; + ucs4_t uc; + + prev_s = U_PREV (&uc, s, s_start); + if (prev_s == NULL) + /* Ill-formed UTF-8 encoding. */ + break; + + if (uc_graphemeclusterbreak_property (uc) == GBP_RI) + ri_count++; + else + break; + + s = prev_s; + } + + return ri_count; +} + +const UNIT * +FUNC (const UNIT *s, const UNIT *s_start) +{ + if (s == s_start) + return NULL; + + /* Traverse the string backwards, from s down to s_start. */ + + /* Grapheme Cluster break property of the next character. + -1 at the very end of the string. */ + int next_char_prop = -1; + + /* Indic_Conjunct_Break property of the next character. + -1 at the very end of the string. */ + int next_char_incb = -1; + + /* Extended_Pictographic property of the next character. + false at the very end of the string. */ + bool next_char_epic = false; + + do + { + const UNIT *prev_s; + ucs4_t uc; + + prev_s = U_PREV (&uc, s, s_start); + if (prev_s == NULL) + { + /* Ill-formed UTF-8 encoding. */ + return s_start; + } + + int prop = uc_graphemeclusterbreak_property (uc); + int incb = uc_indic_conjunct_break (uc); + bool epic = uc_is_property_extended_pictographic (uc); + + /* Break at the end of the string (GB2). */ + if (next_char_prop < 0) + /* *p = 1 */; + else + { + /* No break between CR and LF (GB3). */ + if (prop == GBP_CR && next_char_prop == GBP_LF) + /* *p = 0 */; + /* Break before and after newlines (GB4, GB5). */ + else if ((prop == GBP_CR + || prop == GBP_LF + || prop == GBP_CONTROL) + || (next_char_prop == GBP_CR + || next_char_prop == GBP_LF + || next_char_prop == GBP_CONTROL)) + break /* *p = 1 */; + /* No break between Hangul syllable sequences (GB6, GB7, GB8). */ + else if ((prop == GBP_L + && (next_char_prop == GBP_L + || next_char_prop == GBP_V + || next_char_prop == GBP_LV + || next_char_prop == GBP_LVT)) + || ((prop == GBP_LV + || prop == GBP_V) + && (next_char_prop == GBP_V + || next_char_prop == GBP_T)) + || ((prop == GBP_LVT + || prop == GBP_T) + && next_char_prop == GBP_T)) + /* *p = 0 */; + /* No break before extending characters or ZWJ (GB9). */ + else if (next_char_prop == GBP_EXTEND || next_char_prop == GBP_ZWJ) + /* *p = 0 */; + /* No break before SpacingMarks (GB9a). */ + else if (next_char_prop == GBP_SPACINGMARK) + /* *p = 0 */; + /* No break after Prepend characters (GB9b). */ + else if (prop == GBP_PREPEND) + /* *p = 0 */; + /* No break within certain combinations of Indic_Conjunct_Break + values: Between + consonant {extend|linker}* linker {extend|linker}* + and + consonant + (GB9c). */ + else if (next_char_incb == UC_INDIC_CONJUNCT_BREAK_CONSONANT + && ends_with_incb_consonant_extended_linker_extended (s, s_start)) + /* *p = 0 */; + /* No break within emoji modifier sequences or emoji zwj sequences + (GB11). */ + else if (next_char_epic + && prop == GBP_ZWJ + && ends_with_emoji_modifier_sequence (prev_s, s_start)) + /* *p = 0 */; + /* No break between RI if there is an odd number of RI + characters before (GB12, GB13). */ + else if (next_char_prop == GBP_RI + && prop == GBP_RI + && (ends_with_ri_count (prev_s, s_start) % 2) == 0) + /* *p = 0 */; + /* Break everywhere (GB999). */ + else + break /* *p = 1 */; + } + + s = prev_s; + next_char_prop = prop; + next_char_incb = incb; + next_char_epic = epic; + } + while (s > s_start); + + return s; +} diff --git a/lib/unigbrk/u16-grapheme-prev.c b/lib/unigbrk/u16-grapheme-prev.c index 02fe72f261..4c70e11843 100644 --- a/lib/unigbrk/u16-grapheme-prev.c +++ b/lib/unigbrk/u16-grapheme-prev.c @@ -1,6 +1,5 @@ /* Previous grapheme cluster function. Copyright (C) 2010-2025 Free Software Foundation, Inc. - Written by Ben Pfaff <b...@cs.stanford.edu>, 2010. This file is free software. It is dual-licensed under "the GNU LGPLv3+ or the GNU GPLv2+". @@ -23,6 +22,8 @@ License and of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>. */ +/* Written by Bruno Haible <br...@clisp.org>, 2025. */ + /* Don't use the const-improved function macros in this compilation unit. */ #define _LIBUNISTRING_NO_CONST_GENERICS @@ -31,35 +32,10 @@ /* Specification. */ #include "unigbrk.h" +#include "unictype.h" #include "unistr.h" -const uint16_t * -u16_grapheme_prev (const uint16_t *s, const uint16_t *start) -{ - ucs4_t next; - - if (s == start) - return NULL; - - s = u16_prev (&next, s, start); - while (s != start) - { - const uint16_t *prev_s; - ucs4_t prev; - - prev_s = u16_prev (&prev, s, start); - if (prev_s == NULL) - { - /* Ill-formed UTF-16 encoding. */ - return start; - } - - if (uc_is_grapheme_break (prev, next)) - break; - - s = prev_s; - next = prev; - } - - return s; -} +#define FUNC u16_grapheme_prev +#define UNIT uint16_t +#define U_PREV u16_prev +#include "u-grapheme-prev.h" diff --git a/lib/unigbrk/u32-grapheme-prev.c b/lib/unigbrk/u32-grapheme-prev.c index c76fb9ab52..977a1977c6 100644 --- a/lib/unigbrk/u32-grapheme-prev.c +++ b/lib/unigbrk/u32-grapheme-prev.c @@ -1,6 +1,5 @@ /* Previous grapheme cluster function. Copyright (C) 2010-2025 Free Software Foundation, Inc. - Written by Ben Pfaff <b...@cs.stanford.edu>, 2010. This file is free software. It is dual-licensed under "the GNU LGPLv3+ or the GNU GPLv2+". @@ -23,6 +22,8 @@ License and of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>. */ +/* Written by Bruno Haible <br...@clisp.org>, 2025. */ + /* Don't use the const-improved function macros in this compilation unit. */ #define _LIBUNISTRING_NO_CONST_GENERICS @@ -31,32 +32,10 @@ /* Specification. */ #include "unigbrk.h" +#include "unictype.h" #include "unistr.h" -const uint32_t * -u32_grapheme_prev (const uint32_t *s, const uint32_t *start) -{ - ucs4_t next; - - if (s == start) - return NULL; - - u32_prev (&next, s, start); - for (s--; s != start; s--) - { - ucs4_t prev; - - if (u32_prev (&prev, s, start) == NULL) - { - /* Ill-formed UTF-32 encoding. */ - return start; - } - - if (uc_is_grapheme_break (prev, next)) - break; - - next = prev; - } - - return s; -} +#define FUNC u32_grapheme_prev +#define UNIT uint32_t +#define U_PREV u32_prev +#include "u-grapheme-prev.h" diff --git a/lib/unigbrk/u8-grapheme-prev.c b/lib/unigbrk/u8-grapheme-prev.c index 79748cf3fb..a2d872f01b 100644 --- a/lib/unigbrk/u8-grapheme-prev.c +++ b/lib/unigbrk/u8-grapheme-prev.c @@ -1,6 +1,5 @@ /* Previous grapheme cluster function. Copyright (C) 2010-2025 Free Software Foundation, Inc. - Written by Ben Pfaff <b...@cs.stanford.edu>, 2010. This file is free software. It is dual-licensed under "the GNU LGPLv3+ or the GNU GPLv2+". @@ -23,6 +22,8 @@ License and of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>. */ +/* Written by Bruno Haible <br...@clisp.org>, 2025. */ + /* Don't use the const-improved function macros in this compilation unit. */ #define _LIBUNISTRING_NO_CONST_GENERICS @@ -31,35 +32,10 @@ /* Specification. */ #include "unigbrk.h" +#include "unictype.h" #include "unistr.h" -const uint8_t * -u8_grapheme_prev (const uint8_t *s, const uint8_t *start) -{ - ucs4_t next; - - if (s == start) - return NULL; - - s = u8_prev (&next, s, start); - while (s != start) - { - const uint8_t *prev_s; - ucs4_t prev; - - prev_s = u8_prev (&prev, s, start); - if (prev_s == NULL) - { - /* Ill-formed UTF-8 encoding. */ - return start; - } - - if (uc_is_grapheme_break (prev, next)) - break; - - s = prev_s; - next = prev; - } - - return s; -} +#define FUNC u8_grapheme_prev +#define UNIT uint8_t +#define U_PREV u8_prev +#include "u-grapheme-prev.h" diff --git a/modules/unigbrk/u16-grapheme-prev b/modules/unigbrk/u16-grapheme-prev index d9393efabf..1135de7230 100644 --- a/modules/unigbrk/u16-grapheme-prev +++ b/modules/unigbrk/u16-grapheme-prev @@ -3,15 +3,20 @@ Find start of previous grapheme cluster in UTF-16 string. Files: lib/unigbrk/u16-grapheme-prev.c +lib/unigbrk/u-grapheme-prev.h tests/macros.h Depends-on: -unigbrk/uc-is-grapheme-break +unigbrk/base +unigbrk/uc-gbrk-prop +unictype/incb-of +unictype/property-extended-pictographic unistr/u16-prev +bool configure.ac: gl_MODULE_INDICATOR([unigbrk/u16-grapheme-prev]) -gl_LIBUNISTRING_MODULE([1.3], [unigbrk/u16-grapheme-prev]) +gl_LIBUNISTRING_MODULE([1.4], [unigbrk/u16-grapheme-prev]) Makefile.am: if LIBUNISTRING_COMPILE_UNIGBRK_U16_GRAPHEME_PREV diff --git a/modules/unigbrk/u32-grapheme-prev b/modules/unigbrk/u32-grapheme-prev index 4997a508eb..6d3223a813 100644 --- a/modules/unigbrk/u32-grapheme-prev +++ b/modules/unigbrk/u32-grapheme-prev @@ -3,15 +3,20 @@ Find start of previous grapheme cluster in UTF-32 string. Files: lib/unigbrk/u32-grapheme-prev.c +lib/unigbrk/u-grapheme-prev.h tests/macros.h Depends-on: -unigbrk/uc-is-grapheme-break +unigbrk/base +unigbrk/uc-gbrk-prop +unictype/incb-of +unictype/property-extended-pictographic unistr/u32-prev +bool configure.ac: gl_MODULE_INDICATOR([unigbrk/u32-grapheme-prev]) -gl_LIBUNISTRING_MODULE([1.3], [unigbrk/u32-grapheme-prev]) +gl_LIBUNISTRING_MODULE([1.4], [unigbrk/u32-grapheme-prev]) Makefile.am: if LIBUNISTRING_COMPILE_UNIGBRK_U32_GRAPHEME_PREV diff --git a/modules/unigbrk/u8-grapheme-prev b/modules/unigbrk/u8-grapheme-prev index 29c9501ab9..1ed0335c0c 100644 --- a/modules/unigbrk/u8-grapheme-prev +++ b/modules/unigbrk/u8-grapheme-prev @@ -3,15 +3,20 @@ Find start of previous grapheme cluster in UTF-8 string. Files: lib/unigbrk/u8-grapheme-prev.c +lib/unigbrk/u-grapheme-prev.h tests/macros.h Depends-on: -unigbrk/uc-is-grapheme-break +unigbrk/base +unigbrk/uc-gbrk-prop +unictype/incb-of +unictype/property-extended-pictographic unistr/u8-prev +bool configure.ac: gl_MODULE_INDICATOR([unigbrk/u8-grapheme-prev]) -gl_LIBUNISTRING_MODULE([1.3], [unigbrk/u8-grapheme-prev]) +gl_LIBUNISTRING_MODULE([1.4], [unigbrk/u8-grapheme-prev]) Makefile.am: if LIBUNISTRING_COMPILE_UNIGBRK_U8_GRAPHEME_PREV diff --git a/tests/unigbrk/test-u16-grapheme-prev.c b/tests/unigbrk/test-u16-grapheme-prev.c index 60d1ec9e63..4baec3b0cc 100644 --- a/tests/unigbrk/test-u16-grapheme-prev.c +++ b/tests/unigbrk/test-u16-grapheme-prev.c @@ -97,6 +97,15 @@ main (void) test_u16_grapheme_prev (1, 'e', ACUTE, 'x', -1); test_u16_grapheme_prev (2, 'e', ACUTE, 'e', ACUTE, -1); + /* CR LF handling. */ + test_u16_grapheme_prev (2, 'c', '\r', '\n', -1); + + /* Emoji modifier / ZWJ sequence. */ + test_u16_grapheme_prev (5, 0x2605, 0x0305, 0x0347, 0x200D, 0x2600, -1); + + /* Regional indicators. */ + test_u16_grapheme_prev (4, 0xD83C, 0xDDE9, 0xD83C, 0xDDEA, 0xD83C, 0xDDEB, 0xD83C, 0xDDF7, -1); + /* Surrogate pairs. */ test_u16_grapheme_prev (2, 0xd83d, 0xde10, -1); /* ????: neutral face. */ test_u16_grapheme_prev (3, 0xd83d, 0xde10, GRAVE, -1); diff --git a/tests/unigbrk/test-u32-grapheme-prev.c b/tests/unigbrk/test-u32-grapheme-prev.c index 8420fa4968..ae855bf337 100644 --- a/tests/unigbrk/test-u32-grapheme-prev.c +++ b/tests/unigbrk/test-u32-grapheme-prev.c @@ -97,6 +97,15 @@ main (void) test_u32_grapheme_prev (1, 'e', ACUTE, 'x', -1); test_u32_grapheme_prev (2, 'e', ACUTE, 'e', ACUTE, -1); + /* CR LF handling. */ + test_u32_grapheme_prev (2, 'c', '\r', '\n', -1); + + /* Emoji modifier / ZWJ sequence. */ + test_u32_grapheme_prev (5, 0x2605, 0x0305, 0x0347, 0x200D, 0x2600, -1); + + /* Regional indicators. */ + test_u32_grapheme_prev (2, 0x1F1E9, 0x1F1EA, 0x1F1EB, 0x1F1F7, -1); + /* Outside BMP. */ #define NEUTRAL_FACE 0x1f610 /* ????: neutral face. */ test_u32_grapheme_prev (1, NEUTRAL_FACE, -1); diff --git a/tests/unigbrk/test-u8-grapheme-prev.c b/tests/unigbrk/test-u8-grapheme-prev.c index 0a63d4dc3f..6d6ab46ac0 100644 --- a/tests/unigbrk/test-u8-grapheme-prev.c +++ b/tests/unigbrk/test-u8-grapheme-prev.c @@ -77,5 +77,16 @@ main (void) test_u8_grapheme_prev ("e"ACUTE"x", 4, 1); test_u8_grapheme_prev ("e"ACUTE "e"ACUTE, 6, 3); + /* CR LF handling. */ + test_u8_grapheme_prev ("c\r\n", 3, 2); + + /* Emoji modifier / ZWJ sequence. */ + test_u8_grapheme_prev ("\342\230\205\314\205\315\207\342\200\215\342\230\200", + 13, 13); + + /* Regional indicators. */ + test_u8_grapheme_prev ("\360\237\207\251\360\237\207\252\360\237\207\253\360\237\207\267", + 16, 8); + return test_exit_status; } -- 2.43.0