On 2023-07-06 11:34, Bruno Haible wrote:

Indeed, this is the solution that makes no assumptions. Find a patch that
does it.

Thanks, I think. I installed that, and in reviewing it found a minor glitch or two and some opportunities for simplification. I installed the attached further patch which I hope fixes glitches without breaking anything else.


I had expected that the replacement mbrtowc -> mbrtoc32 would be purely
mechanical; I'm surprised that it requires application-specific considerations
here.

Yes, the (size_t) -3 change can be quite a pain.

In my review I found another only-theoretical problem involving locking-shift sequences. The patch adds a brief comment about one of them. However, I'm not planning to lose any sleep over this, as you and I have already spent more time on theoretical platforms than they're likely worth.

One other thing I discovered in my review. POSIX says that 'diff' need not support locking-shift sequences[1], and this business of mbrtoc23 returning (size_t) -3 is in a murky area as it would appear to fall into the locking-shift sequence category (at any rate, it doesn't appear to be a single-shift encoding which is POSIX's only other option for state-dependent encodings). Or maybe the next version of POSIX will have to change in this area? Either way, I wouldn't be surprised if supporting (size_t) -3 would not be a POSIX conformance issue for GNU 'diff' even on theoretical platforms where (size_t) -3 is possible.

[1]: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html#tag_06_02
From 2055c9cffd6afc70a1f24b91cc27e7e73982bb60 Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Thu, 6 Jul 2023 17:00:26 -0700
Subject: [PATCH] diff: simplify recent mbrtoc32 improvement

* src/side.c (print_half_line): Simplify.  Don't worry about
initializing mbstate until it's needed.  Avoid int overflow if the
byte sequence represents more than INT_MAX columns.  Avoid need
for separate TP1 local.
---
 src/side.c | 130 ++++++++++++++++++++++-------------------------------
 1 file changed, 54 insertions(+), 76 deletions(-)

diff --git a/src/side.c b/src/side.c
index 840d199..6b7687f 100644
--- a/src/side.c
+++ b/src/side.c
@@ -82,7 +82,6 @@ print_half_line (char const *const *line, intmax_t indent, intmax_t out_bound)
   intmax_t out_position = 0;
   char const *text_pointer = line[0];
   char const *text_limit = line[1];
-  mbstate_t mbstate = { 0 };
 
   while (text_pointer < text_limit)
     {
@@ -141,83 +140,62 @@ print_half_line (char const *const *line, intmax_t indent, intmax_t out_bound)
           break;
 
         default:
-          /* Invariant: mbstate is in the initial state here.  */
           {
-            char const *tp1 = tp0;
-            /* Sum of the widths of the 32-bit wide characters between
-               TP0 and TP1.  */
-            int width_sum = 0;
-
-            for (;;)
-              {
-                /* Invariant: The multibyte sequences between TP0 and TP1
-                   have already been parsed, but not yet been output to OUT
-                   (because whether we can output it or not depends on the
-                   total width, and we can only determine the total width
-                   once we have arrived at a point where MBSTATE is in the
-                   initial state).  */
-                char32_t wc;
-                size_t bytes = mbrtoc32 (&wc, tp1, text_limit - tp1, &mbstate);
-
-                if (bytes < (size_t) -2)
-                  {
-                    /* mbrtoc32 has produced a (or another) 32-bit wide
-                       character.  Add its width to WIDTH_SUM.  */
-                    int width = c32width (wc);
-                    if (width > 0)
-                      if (ckd_add (&width_sum, width_sum, width))
-                        return out_position;
-                    /* Advance TP1.  */
-                    if (bytes != (size_t) -3)
-                      tp1 += (bytes == 0 ? 1 : bytes);
-                  }
-
-                if (bytes >= (size_t) -2 || mbsinit (&mbstate))
-                  {
-                    /* Now see whether we have room for WIDTH_SUM columns,
-                       and if so, output the multibyte sequences between
-                       TP0 and TP1.  */
-                    if (tp0 < tp1)
-                      {
-                        if (ckd_add (&in_position, in_position, width_sum))
-                          return out_position;
-                        if (in_position <= out_bound)
-                          {
-                            out_position = in_position;
-                            fwrite (tp0, 1, tp1 - tp0, out);
-                          }
-                      }
-                    tp0 = tp1;
-
-                    if (bytes >= (size_t) -2)
-                      {
-                        /* An encoding error (bytes == (size_t) -1), as
-                           (size_t) -2 cannot happen as the buffer ends
-                           in '\n'.  */
-                        if (tp0 < text_limit)
-                          {
-                            /* Consume one byte.  Assume it has
-                               print width 1.  */
-                            if (ckd_add (&in_position, in_position, 1))
-                              return out_position;
-                            if (in_position <= out_bound)
-                              {
-                                out_position = in_position;
-                                putc (*tp0, out);
-                              }
-                            tp0++;
-                          }
-                        memset (&mbstate, '\0', sizeof mbstate);
-                      }
-
-                    /* We are done with this multibyte sequence.
-                       Return to the simpler processing in the outer loop.  */
-                    break;
-                  }
-              }
+	    /* A byte that might start a multibyte character.
+	       Increase TEXT_POINTER, counting columns, until MBSTATE
+	       becomes the initial state again.
+
+	       In practice this code is overkill: on realistic platforms
+	       mbrtoc32 never returns (size_t) -3 and always results in
+	       the initial state unless it returns (size_t) -1.
+	       Although handling (size_t) -3 and non initial states
+	       doesn't hurt complexity significantly, do not handle
+	       other theoretical cases that POSIX allows (such as
+	       mbrtoc32 setting wc = '\r') as it's too painful.  */
+
+	    mbstate_t mbstate = { 0 };
+	    text_pointer = tp0;
+
+	    do
+	      {
+		/* The special value mbrtoc23 returns when it sets WC
+		   to a character but consumes no bytes.  This can
+		   happen only in theory.  Return values greater than
+		   this denote encoding errors.  */
+		size_t SUBSEQUENT = (size_t) -3;
+
+		/* Scan one character or encoding error.
+		   BYTES != (size_t) -2 because TEXT_LIMIT[-1] == '\n'.  */
+		char32_t wc;
+		size_t bytes = mbrtoc32 (&wc, text_pointer,
+					 text_limit - text_pointer, &mbstate);
+		if (bytes != SUBSEQUENT)
+		  text_pointer += 0 < bytes && bytes < SUBSEQUENT ? bytes : 1;
+
+		/* Assume encoding errors have print width 1.  */
+		int width = bytes <= SUBSEQUENT ? c32width (wc) : 1;
+		if (0 < width && ckd_add (&in_position, in_position, width))
+		  return out_position;
+
+		/* If past the trailing newline, disgorge it and stop scan.  */
+		if (text_pointer == text_limit)
+		  {
+		    text_pointer--;
+		    break;
+		  }
+
+		if (SUBSEQUENT < bytes)
+		  break;
+	      }
+	    while (! mbsinit (&mbstate));
+
+	    /* If there is room, output the bytes since TP0.  */
+	    if (in_position <= out_bound)
+	      {
+		out_position = in_position;
+		fwrite (tp0, 1, text_pointer - tp0, out);
+	      }
           }
-          /* Invariant: mbstate is in the initial state here again.  */
-          text_pointer = tp0;
           break;
 
         /* Print width 1.  */
-- 
2.39.2

Reply via email to