On 2023-07-06 11:34, Bruno Haible wrote:
Indeed, this is the solution that makes no assumptions. Find a patch that does it.
Thanks, I think. I installed that, and in reviewing it found a minor glitch or two and some opportunities for simplification. I installed the attached further patch which I hope fixes glitches without breaking anything else.
I had expected that the replacement mbrtowc -> mbrtoc32 would be purely mechanical; I'm surprised that it requires application-specific considerations here.
Yes, the (size_t) -3 change can be quite a pain.In my review I found another only-theoretical problem involving locking-shift sequences. The patch adds a brief comment about one of them. However, I'm not planning to lose any sleep over this, as you and I have already spent more time on theoretical platforms than they're likely worth.
One other thing I discovered in my review. POSIX says that 'diff' need not support locking-shift sequences[1], and this business of mbrtoc23 returning (size_t) -3 is in a murky area as it would appear to fall into the locking-shift sequence category (at any rate, it doesn't appear to be a single-shift encoding which is POSIX's only other option for state-dependent encodings). Or maybe the next version of POSIX will have to change in this area? Either way, I wouldn't be surprised if supporting (size_t) -3 would not be a POSIX conformance issue for GNU 'diff' even on theoretical platforms where (size_t) -3 is possible.
[1]: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html#tag_06_02
From 2055c9cffd6afc70a1f24b91cc27e7e73982bb60 Mon Sep 17 00:00:00 2001 From: Paul Eggert <egg...@cs.ucla.edu> Date: Thu, 6 Jul 2023 17:00:26 -0700 Subject: [PATCH] diff: simplify recent mbrtoc32 improvement * src/side.c (print_half_line): Simplify. Don't worry about initializing mbstate until it's needed. Avoid int overflow if the byte sequence represents more than INT_MAX columns. Avoid need for separate TP1 local. --- src/side.c | 130 ++++++++++++++++++++++------------------------------- 1 file changed, 54 insertions(+), 76 deletions(-) diff --git a/src/side.c b/src/side.c index 840d199..6b7687f 100644 --- a/src/side.c +++ b/src/side.c @@ -82,7 +82,6 @@ print_half_line (char const *const *line, intmax_t indent, intmax_t out_bound) intmax_t out_position = 0; char const *text_pointer = line[0]; char const *text_limit = line[1]; - mbstate_t mbstate = { 0 }; while (text_pointer < text_limit) { @@ -141,83 +140,62 @@ print_half_line (char const *const *line, intmax_t indent, intmax_t out_bound) break; default: - /* Invariant: mbstate is in the initial state here. */ { - char const *tp1 = tp0; - /* Sum of the widths of the 32-bit wide characters between - TP0 and TP1. */ - int width_sum = 0; - - for (;;) - { - /* Invariant: The multibyte sequences between TP0 and TP1 - have already been parsed, but not yet been output to OUT - (because whether we can output it or not depends on the - total width, and we can only determine the total width - once we have arrived at a point where MBSTATE is in the - initial state). */ - char32_t wc; - size_t bytes = mbrtoc32 (&wc, tp1, text_limit - tp1, &mbstate); - - if (bytes < (size_t) -2) - { - /* mbrtoc32 has produced a (or another) 32-bit wide - character. Add its width to WIDTH_SUM. */ - int width = c32width (wc); - if (width > 0) - if (ckd_add (&width_sum, width_sum, width)) - return out_position; - /* Advance TP1. */ - if (bytes != (size_t) -3) - tp1 += (bytes == 0 ? 1 : bytes); - } - - if (bytes >= (size_t) -2 || mbsinit (&mbstate)) - { - /* Now see whether we have room for WIDTH_SUM columns, - and if so, output the multibyte sequences between - TP0 and TP1. */ - if (tp0 < tp1) - { - if (ckd_add (&in_position, in_position, width_sum)) - return out_position; - if (in_position <= out_bound) - { - out_position = in_position; - fwrite (tp0, 1, tp1 - tp0, out); - } - } - tp0 = tp1; - - if (bytes >= (size_t) -2) - { - /* An encoding error (bytes == (size_t) -1), as - (size_t) -2 cannot happen as the buffer ends - in '\n'. */ - if (tp0 < text_limit) - { - /* Consume one byte. Assume it has - print width 1. */ - if (ckd_add (&in_position, in_position, 1)) - return out_position; - if (in_position <= out_bound) - { - out_position = in_position; - putc (*tp0, out); - } - tp0++; - } - memset (&mbstate, '\0', sizeof mbstate); - } - - /* We are done with this multibyte sequence. - Return to the simpler processing in the outer loop. */ - break; - } - } + /* A byte that might start a multibyte character. + Increase TEXT_POINTER, counting columns, until MBSTATE + becomes the initial state again. + + In practice this code is overkill: on realistic platforms + mbrtoc32 never returns (size_t) -3 and always results in + the initial state unless it returns (size_t) -1. + Although handling (size_t) -3 and non initial states + doesn't hurt complexity significantly, do not handle + other theoretical cases that POSIX allows (such as + mbrtoc32 setting wc = '\r') as it's too painful. */ + + mbstate_t mbstate = { 0 }; + text_pointer = tp0; + + do + { + /* The special value mbrtoc23 returns when it sets WC + to a character but consumes no bytes. This can + happen only in theory. Return values greater than + this denote encoding errors. */ + size_t SUBSEQUENT = (size_t) -3; + + /* Scan one character or encoding error. + BYTES != (size_t) -2 because TEXT_LIMIT[-1] == '\n'. */ + char32_t wc; + size_t bytes = mbrtoc32 (&wc, text_pointer, + text_limit - text_pointer, &mbstate); + if (bytes != SUBSEQUENT) + text_pointer += 0 < bytes && bytes < SUBSEQUENT ? bytes : 1; + + /* Assume encoding errors have print width 1. */ + int width = bytes <= SUBSEQUENT ? c32width (wc) : 1; + if (0 < width && ckd_add (&in_position, in_position, width)) + return out_position; + + /* If past the trailing newline, disgorge it and stop scan. */ + if (text_pointer == text_limit) + { + text_pointer--; + break; + } + + if (SUBSEQUENT < bytes) + break; + } + while (! mbsinit (&mbstate)); + + /* If there is room, output the bytes since TP0. */ + if (in_position <= out_bound) + { + out_position = in_position; + fwrite (tp0, 1, text_pointer - tp0, out); + } } - /* Invariant: mbstate is in the initial state here again. */ - text_pointer = tp0; break; /* Print width 1. */ -- 2.39.2