Re: uninorm/composition: Make more maintainable

Bruno Haible Sat, 14 Sep 2024 03:04:41 -0700

Hi Collin,

> Thanks, I was going to look at the new 'DoNotEmit.txt' file added in
> Unicode 16.0.0 to see if I could write anything useful with it.


https://www.unicode.org/versions/Unicode16.0.0/#Summary says:
  "This data could be used by applications such as input methods or
   autocorrect."
Input methods are out-of-scope of Gnulib, because they are specialized
programs / libraries / plugins. Regarding autocorrect, I'm not sure they
need the DoNotEmit data in a general form.

So, I wouldn't spend much time on it.

> I noticed a small difference in my output from gen-uni-tables.c, see below:

Oh, that points to undefined behaviour. And indeed, clang+asan+ubsan
(my preferred debugging setting) pinpoints the problem. Fixed through this
patch:


2024-09-14  Bruno Haible  <br...@clisp.org>

        unilbrk/tables: Fix table (regression yesterday).
        Reported by Collin Funk <collin.fu...@gmail.com> in
        <https://lists.gnu.org/archive/html/bug-gnulib/2024-09/msg00061.html>.
        * lib/gen-uni-tables.c (output_lbrk_rules_as_tables): Use LBP_AL1 as
        array index instead of LBP_AL. Update comments.
        * lib/unilbrk/lbrktables.c: Regenerated.

diff --git a/lib/gen-uni-tables.c b/lib/gen-uni-tables.c
index c27f870bd1..dac7715f55 100644
--- a/lib/gen-uni-tables.c
+++ b/lib/gen-uni-tables.c
@@ -9480,10 +9480,10 @@ output_lbrk_rules_as_tables (const char *filename, 
const char *version)
   /* (LB10) Treat any remaining combining mark or ZWJ as AL.  */
   /* We resolve LBP_CM at runtime, before accessing the table.  */
   for (before = 0; before < NLBP; before++)
-    table[before][LBP_ZWJ] = table[before][LBP_AL];
+    table[before][LBP_ZWJ] = table[before][LBP_AL1];
   for (after = 0; after < NLBP; after++)
-    table[LBP_ZWJ][after] = table[LBP_AL][after];
-  table[LBP_ZWJ][LBP_ZWJ] = table[LBP_AL][LBP_AL];
+    table[LBP_ZWJ][after] = table[LBP_AL1][after];
+  table[LBP_ZWJ][LBP_ZWJ] = table[LBP_AL1][LBP_AL1];
 
   /* (LB8a) Do not break between a zero width joiner and an ideograph, emoji
      base or emoji modifier.  */
@@ -9495,12 +9495,25 @@ output_lbrk_rules_as_tables (const char *filename, 
const char *version)
   (LB30a) Break between two regional indicator symbols if and only if there are
           an even number of regional indicators preceding the position of the
           break.
-  (LB21a) Don't break after Hebrew + Hyphen.
+  (LB28a) Don't break inside orthographic syllables of Brahmic scripts, lines
+          3 and 4.
+  (LB25) Do not break between the following pairs of classes relevant to
+         numbers, lines with NU (SY|IS)* or OP NU or OP IS NU.
+  (LB21a) Don't break after Hebrew + Hyphen/Break-After, before non-Hebrew.
+  (LB20a) Don't break after a word-initial hyphen.
   (LB20) Break before and after unresolved CB.
          We resolve LBP_CB at runtime, before accessing the table.
+  (LB19a) Don't break on either side of ambiguous quotation marks, except next
+          to an EastAsian character.
+  (LB15c) Break before a decimal mark that follows a space.
+  Part of (LB15b) Do not break before an ambiguous quotation that is a final
+                  punctuation, even after spaces.
+  Part of (LB15a) Do not break before an ambiguous quotation that is an initial
+                  punctuation, even after spaces.
   (LB9) Do not break a combining character sequence; treat it as if it has the
         line breaking class of the base character in all of the following 
rules.
         Treat ZWJ as if it were CM.
+  Part of (LB8a) Don't break right after a zero-width joiner.
   (LB8) Break before any character following a zero-width space, even if one
         or more spaces intervene.
         We handle LBP_ZW at runtime, before accessing the table.
diff --git a/lib/unilbrk/lbrktables.c b/lib/unilbrk/lbrktables.c
(Regenerated)

Re: uninorm/composition: Make more maintainable

Reply via email to