Re: From wchar_t to char32_t

Paul Eggert Thu, 06 Jul 2023 17:36:49 -0700

On 2023-07-06 11:34, Bruno Haible wrote:

Indeed, this is the solution that makes no assumptions. Find a patch that
does it.

Thanks, I think. I installed that, and in reviewing it found a minorglitch or two and some opportunities for simplification. I installed theattached further patch which I hope fixes glitches without breakinganything else.

I had expected that the replacement mbrtowc -> mbrtoc32 would be purely
mechanical; I'm surprised that it requires application-specific considerations
here.


Yes, the (size_t) -3 change can be quite a pain.

In my review I found another only-theoretical problem involvinglocking-shift sequences. The patch adds a brief comment about one ofthem. However, I'm not planning to lose any sleep over this, as you andI have already spent more time on theoretical platforms than they'relikely worth.

One other thing I discovered in my review. POSIX says that 'diff' neednot support locking-shift sequences[1], and this business of mbrtoc23returning (size_t) -3 is in a murky area as it would appear to fall intothe locking-shift sequence category (at any rate, it doesn't appear tobe a single-shift encoding which is POSIX's only other option forstate-dependent encodings). Or maybe the next version of POSIX will haveto change in this area? Either way, I wouldn't be surprised ifsupporting (size_t) -3 would not be a POSIX conformance issue for GNU'diff' even on theoretical platforms where (size_t) -3 is possible.

[1]:https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html#tag_06_02

From 2055c9cffd6afc70a1f24b91cc27e7e73982bb60 Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Thu, 6 Jul 2023 17:00:26 -0700
Subject: [PATCH] diff: simplify recent mbrtoc32 improvement

* src/side.c (print_half_line): Simplify.  Don't worry about
initializing mbstate until it's needed.  Avoid int overflow if the
byte sequence represents more than INT_MAX columns.  Avoid need
for separate TP1 local.
---
 src/side.c | 130 ++++++++++++++++++++++-------------------------------
 1 file changed, 54 insertions(+), 76 deletions(-)

diff --git a/src/side.c b/src/side.c
index 840d199..6b7687f 100644
--- a/src/side.c
+++ b/src/side.c
@@ -82,7 +82,6 @@ print_half_line (char const *const *line, intmax_t indent, intmax_t out_bound)
   intmax_t out_position = 0;
   char const *text_pointer = line[0];
   char const *text_limit = line[1];
-  mbstate_t mbstate = { 0 };
 
   while (text_pointer < text_limit)
     {
@@ -141,83 +140,62 @@ print_half_line (char const *const *line, intmax_t indent, intmax_t out_bound)
           break;
 
         default:
-          /* Invariant: mbstate is in the initial state here.  */
           {
-            char const *tp1 = tp0;
-            /* Sum of the widths of the 32-bit wide characters between
-               TP0 and TP1.  */
-            int width_sum = 0;
-
-            for (;;)
-              {
-                /* Invariant: The multibyte sequences between TP0 and TP1
-                   have already been parsed, but not yet been output to OUT
-                   (because whether we can output it or not depends on the
-                   total width, and we can only determine the total width
-                   once we have arrived at a point where MBSTATE is in the
-                   initial state).  */
-                char32_t wc;
-                size_t bytes = mbrtoc32 (&wc, tp1, text_limit - tp1, &mbstate);
-
-                if (bytes < (size_t) -2)
-                  {
-                    /* mbrtoc32 has produced a (or another) 32-bit wide
-                       character.  Add its width to WIDTH_SUM.  */
-                    int width = c32width (wc);
-                    if (width > 0)
-                      if (ckd_add (&width_sum, width_sum, width))
-                        return out_position;
-                    /* Advance TP1.  */
-                    if (bytes != (size_t) -3)
-                      tp1 += (bytes == 0 ? 1 : bytes);
-                  }
-
-                if (bytes >= (size_t) -2 || mbsinit (&mbstate))
-                  {
-                    /* Now see whether we have room for WIDTH_SUM columns,
-                       and if so, output the multibyte sequences between
-                       TP0 and TP1.  */
-                    if (tp0 < tp1)
-                      {
-                        if (ckd_add (&in_position, in_position, width_sum))
-                          return out_position;
-                        if (in_position <= out_bound)
-                          {
-                            out_position = in_position;
-                            fwrite (tp0, 1, tp1 - tp0, out);
-                          }
-                      }
-                    tp0 = tp1;
-
-                    if (bytes >= (size_t) -2)
-                      {
-                        /* An encoding error (bytes == (size_t) -1), as
-                           (size_t) -2 cannot happen as the buffer ends
-                           in '\n'.  */
-                        if (tp0 < text_limit)
-                          {
-                            /* Consume one byte.  Assume it has
-                               print width 1.  */
-                            if (ckd_add (&in_position, in_position, 1))
-                              return out_position;
-                            if (in_position <= out_bound)
-                              {
-                                out_position = in_position;
-                                putc (*tp0, out);
-                              }
-                            tp0++;
-                          }
-                        memset (&mbstate, '\0', sizeof mbstate);
-                      }
-
-                    /* We are done with this multibyte sequence.
-                       Return to the simpler processing in the outer loop.  */
-                    break;
-                  }
-              }
+	    /* A byte that might start a multibyte character.
+	       Increase TEXT_POINTER, counting columns, until MBSTATE
+	       becomes the initial state again.
+
+	       In practice this code is overkill: on realistic platforms
+	       mbrtoc32 never returns (size_t) -3 and always results in
+	       the initial state unless it returns (size_t) -1.
+	       Although handling (size_t) -3 and non initial states
+	       doesn't hurt complexity significantly, do not handle
+	       other theoretical cases that POSIX allows (such as
+	       mbrtoc32 setting wc = '\r') as it's too painful.  */
+
+	    mbstate_t mbstate = { 0 };
+	    text_pointer = tp0;
+
+	    do
+	      {
+		/* The special value mbrtoc23 returns when it sets WC
+		   to a character but consumes no bytes.  This can
+		   happen only in theory.  Return values greater than
+		   this denote encoding errors.  */
+		size_t SUBSEQUENT = (size_t) -3;
+
+		/* Scan one character or encoding error.
+		   BYTES != (size_t) -2 because TEXT_LIMIT[-1] == '\n'.  */
+		char32_t wc;
+		size_t bytes = mbrtoc32 (&wc, text_pointer,
+					 text_limit - text_pointer, &mbstate);
+		if (bytes != SUBSEQUENT)
+		  text_pointer += 0 < bytes && bytes < SUBSEQUENT ? bytes : 1;
+
+		/* Assume encoding errors have print width 1.  */
+		int width = bytes <= SUBSEQUENT ? c32width (wc) : 1;
+		if (0 < width && ckd_add (&in_position, in_position, width))
+		  return out_position;
+
+		/* If past the trailing newline, disgorge it and stop scan.  */
+		if (text_pointer == text_limit)
+		  {
+		    text_pointer--;
+		    break;
+		  }
+
+		if (SUBSEQUENT < bytes)
+		  break;
+	      }
+	    while (! mbsinit (&mbstate));
+
+	    /* If there is room, output the bytes since TP0.  */
+	    if (in_position <= out_bound)
+	      {
+		out_position = in_position;
+		fwrite (tp0, 1, text_pointer - tp0, out);
+	      }
           }
-          /* Invariant: mbstate is in the initial state here again.  */
-          text_pointer = tp0;
           break;
 
         /* Print width 1.  */
-- 
2.39.2

Re: From wchar_t to char32_t

Reply via email to