Hi Collin, > Thanks, I was going to look at the new 'DoNotEmit.txt' file added in > Unicode 16.0.0 to see if I could write anything useful with it.
https://www.unicode.org/versions/Unicode16.0.0/#Summary says: "This data could be used by applications such as input methods or autocorrect." Input methods are out-of-scope of Gnulib, because they are specialized programs / libraries / plugins. Regarding autocorrect, I'm not sure they need the DoNotEmit data in a general form. So, I wouldn't spend much time on it. > I noticed a small difference in my output from gen-uni-tables.c, see below: Oh, that points to undefined behaviour. And indeed, clang+asan+ubsan (my preferred debugging setting) pinpoints the problem. Fixed through this patch: 2024-09-14 Bruno Haible <br...@clisp.org> unilbrk/tables: Fix table (regression yesterday). Reported by Collin Funk <collin.fu...@gmail.com> in <https://lists.gnu.org/archive/html/bug-gnulib/2024-09/msg00061.html>. * lib/gen-uni-tables.c (output_lbrk_rules_as_tables): Use LBP_AL1 as array index instead of LBP_AL. Update comments. * lib/unilbrk/lbrktables.c: Regenerated. diff --git a/lib/gen-uni-tables.c b/lib/gen-uni-tables.c index c27f870bd1..dac7715f55 100644 --- a/lib/gen-uni-tables.c +++ b/lib/gen-uni-tables.c @@ -9480,10 +9480,10 @@ output_lbrk_rules_as_tables (const char *filename, const char *version) /* (LB10) Treat any remaining combining mark or ZWJ as AL. */ /* We resolve LBP_CM at runtime, before accessing the table. */ for (before = 0; before < NLBP; before++) - table[before][LBP_ZWJ] = table[before][LBP_AL]; + table[before][LBP_ZWJ] = table[before][LBP_AL1]; for (after = 0; after < NLBP; after++) - table[LBP_ZWJ][after] = table[LBP_AL][after]; - table[LBP_ZWJ][LBP_ZWJ] = table[LBP_AL][LBP_AL]; + table[LBP_ZWJ][after] = table[LBP_AL1][after]; + table[LBP_ZWJ][LBP_ZWJ] = table[LBP_AL1][LBP_AL1]; /* (LB8a) Do not break between a zero width joiner and an ideograph, emoji base or emoji modifier. */ @@ -9495,12 +9495,25 @@ output_lbrk_rules_as_tables (const char *filename, const char *version) (LB30a) Break between two regional indicator symbols if and only if there are an even number of regional indicators preceding the position of the break. - (LB21a) Don't break after Hebrew + Hyphen. + (LB28a) Don't break inside orthographic syllables of Brahmic scripts, lines + 3 and 4. + (LB25) Do not break between the following pairs of classes relevant to + numbers, lines with NU (SY|IS)* or OP NU or OP IS NU. + (LB21a) Don't break after Hebrew + Hyphen/Break-After, before non-Hebrew. + (LB20a) Don't break after a word-initial hyphen. (LB20) Break before and after unresolved CB. We resolve LBP_CB at runtime, before accessing the table. + (LB19a) Don't break on either side of ambiguous quotation marks, except next + to an EastAsian character. + (LB15c) Break before a decimal mark that follows a space. + Part of (LB15b) Do not break before an ambiguous quotation that is a final + punctuation, even after spaces. + Part of (LB15a) Do not break before an ambiguous quotation that is an initial + punctuation, even after spaces. (LB9) Do not break a combining character sequence; treat it as if it has the line breaking class of the base character in all of the following rules. Treat ZWJ as if it were CM. + Part of (LB8a) Don't break right after a zero-width joiner. (LB8) Break before any character following a zero-width space, even if one or more spaces intervene. We handle LBP_ZW at runtime, before accessing the table. diff --git a/lib/unilbrk/lbrktables.c b/lib/unilbrk/lbrktables.c (Regenerated)