Re: mbcel module for Gnulib?, incomplete multibyte sequences

Paul Eggert Mon, 07 Aug 2023 00:32:34 -0700

On 2023-08-04 16:05, Bruno Haible wrote:

To me, the columns timings of mbiterf and mbuiterf are good enough.

Not to me. Perhaps I'm used to apps like grep and diff where we try toget as much performance as we can (without going off the deep end ofcourse).

There are tradeoffs here: mbcel wins on simplicity and performance,whereas mbiter wins on generality. Since the generality gains (namely,support for encodings that diff doesn't need) are small for diff, thereis space for something like mbcel.

Emacs is a complex beast. I can understand if the Emacs developers want
an implementationally-simple behaviour, rather than a simple-from-the-
user-perspective behaviour.

I don't agree that the MEE approach is necessarily simpler from theuser's perspective. Although it may be simpler for some apps, it's morecomplicated for others, and it's not surprising that Emacs, grep, diff,etc. take the SEE approach. I expect that Gnulib should support SEE forapps that prefer it. I'll try to squeeze free some time to think abouthow to do that.

For MEE, mbiterf would need something like the attached untested patch,
and mbiter, mbcel, etc. would all need similar patches.


Good point.


The attached patch implements that. Look good to you?

(Although maybe you may want to align the module name to be similar
to mbiterf and mbuiterf : maybe mbitervf and mbuitervf for "very fast"?)

I'll think about naming. I hope for something a bit easier tospell/remember than "mbuitervf". (To be honest I'm not sold on theexistence of mviterf and mbuiterf, as they're slower than mbcel even ifmbcel is changed to use MEE.)

More important to my mind is how apps choose between SEE and MEE. Insome sense, the choice between SEE and MEE is orthogonal to the choicebetween mbcel and mbiter, as it'd be easy to modify mbcel to optionallysupport MEE and also easy to modify mbiter to optionally support SEE.

From 9802e8bde49985f5fd8824fd8d03a354096092fc Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Sun, 6 Aug 2023 22:45:51 -0700
Subject: [PATCH] mbiter: return encoding-error prefix lengths
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When mbrtoc32 returns -1, return the length of the smallest
invalid prefix of the input, instead of always returning length 1.
This makes it more convenient for a caller that translates
characters to process input items according to the WHATWG Encoding
Living Standard (2023-06-19) section 4.1 "Encoders and Decoders"
<https://encoding.spec.whatwg.org/#encoders-and-decoders>.
* lib/mbiter.h (mbiter_multi_next):
* lib/mbiterf.h (mbiterf_next):
* lib/mbuiter.h (mbuiter_multi_next):
* lib/mbuiterf.h (mbuiterf_next):
When mbrtoc32 returns (size_t) -1, don’t simply yield length 1.
Instead, return the length of the smallest prefix of the input
for which mbrtoc32 returns (size_t) -1 instead of (size_t) -2.
---
 ChangeLog      | 17 +++++++++++++++++
 lib/mbiter.h   | 22 +++++++++++++++++++---
 lib/mbiterf.h  | 17 +++++++++++++++--
 lib/mbuiter.h  | 24 +++++++++++++++++++-----
 lib/mbuiterf.h | 20 +++++++++++++++++---
 5 files changed, 87 insertions(+), 13 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index 80ac7184d8..d39dfac267 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,20 @@
+2023-08-06  Paul Eggert  <egg...@cs.ucla.edu>
+
+	mbiter: return encoding-error prefix lengths
+	When mbrtoc32 returns -1, return the length of the smallest
+	invalid prefix of the input, instead of always returning length 1.
+	This makes it more convenient for a caller that translates
+	characters to process input items according to the WHATWG Encoding
+	Living Standard (2023-06-19) section 4.1 "Encoders and Decoders"
+	<https://encoding.spec.whatwg.org/#encoders-and-decoders>.
+	* lib/mbiter.h (mbiter_multi_next):
+	* lib/mbiterf.h (mbiterf_next):
+	* lib/mbuiter.h (mbuiter_multi_next):
+	* lib/mbuiterf.h (mbuiterf_next):
+	When mbrtoc32 returns (size_t) -1, don’t simply yield length 1.
+	Instead, return the length of the smallest prefix of the input
+	for which mbrtoc32 returns (size_t) -1 instead of (size_t) -2.
+
 2023-08-05  Paul Eggert  <egg...@cs.ucla.edu>
 
 	readutmp: anticipate Y2038 hack for utmp
diff --git a/lib/mbiter.h b/lib/mbiter.h
index b9222fcc3a..33517a2b3e 100644
--- a/lib/mbiter.h
+++ b/lib/mbiter.h
@@ -150,14 +150,30 @@ mbiter_multi_next (struct mbiter_multi *iter)
       assert (mbsinit (&iter->state));
       #if !GNULIB_MBRTOC32_REGULAR
       iter->in_shift = true;
-    with_shift:
+    with_shift:;
+      mbstate_t prev_state = iter->state;
       #endif
       iter->cur.bytes = mbrtoc32 (&iter->cur.wc, iter->cur.ptr,
                                   iter->limit - iter->cur.ptr, &iter->state);
       if (iter->cur.bytes == (size_t) -1)
         {
-          /* An invalid multibyte sequence was encountered.  */
-          iter->cur.bytes = 1;
+          /* An invalid multibyte sequence was encountered.
+             Find the length of the smallest invalid prefix of the input,
+             so that the caller sees it as a single encoding error. */
+          for (iter->cur.bytes = 1;
+               iter->cur.bytes < iter->limit - iter->cur.ptr;
+               iter->cur.bytes++)
+            {
+              #if GNULIB_MBRTOC32_REGULAR
+              mbszero (&iter->state);
+              #else
+              iter->state = prev_state;
+              #endif
+              if (mbrtoc32 (&iter->cur.wc, iter->cur.ptr, iter->cur.bytes,
+                            &iter->state)
+                  == (size_t) -1)
+                break;
+            }
           iter->cur.wc_valid = false;
           /* Allow the next invocation to continue from a sane state.  */
           #if !GNULIB_MBRTOC32_REGULAR
diff --git a/lib/mbiterf.h b/lib/mbiterf.h
index dea6aaef58..538a61fa4c 100644
--- a/lib/mbiterf.h
+++ b/lib/mbiterf.h
@@ -129,19 +129,32 @@ mbiterf_next (struct mbif_state *ps, const char *iter, const char *endptr)
       #if !GNULIB_MBRTOC32_REGULAR
       ps->in_shift = true;
     with_shift:;
+      mbstate_t prev_state = ps->state;
       #endif
       size_t bytes;
       char32_t wc;
       bytes = mbrtoc32 (&wc, iter, endptr - iter, &ps->state);
       if (bytes == (size_t) -1)
         {
-          /* An invalid multibyte sequence was encountered.  */
+          /* An invalid multibyte sequence was encountered.
+             Find the length of the smallest invalid prefix of the input,
+             so that the caller sees it as a single encoding error. */
+          for (bytes = 1; bytes < endptr - iter; bytes++)
+            {
+              #if GNULIB_MBRTOC32_REGULAR
+              mbszero (&ps->state);
+              #else
+              ps->state = prev_state;
+              #endif
+              if (mbrtoc32 (&wc, iter, bytes, &ps->state) == (size_t) -1)
+                break;
+            }
           /* Allow the next invocation to continue from a sane state.  */
           #if !GNULIB_MBRTOC32_REGULAR
           ps->in_shift = false;
           #endif
           mbszero (&ps->state);
-          return (mbchar_t) { .ptr = iter, .bytes = 1, .wc_valid = false };
+          return (mbchar_t) { .ptr = iter, .bytes = bytes, .wc_valid = false };
         }
       else if (bytes == (size_t) -2)
         {
diff --git a/lib/mbuiter.h b/lib/mbuiter.h
index 862efa3dbe..a432a94e9c 100644
--- a/lib/mbuiter.h
+++ b/lib/mbuiter.h
@@ -159,15 +159,29 @@ mbuiter_multi_next (struct mbuiter_multi *iter)
       assert (mbsinit (&iter->state));
       #if !GNULIB_MBRTOC32_REGULAR
       iter->in_shift = true;
-    with_shift:
+    with_shift:;
+      mbstate_t prev_state = iter->state;
       #endif
-      iter->cur.bytes = mbrtoc32 (&iter->cur.wc, iter->cur.ptr,
-                                  strnlen1 (iter->cur.ptr, iter->cur_max),
+      size_t len1 = strnlen1 (iter->cur.ptr, iter->cur_max);
+      iter->cur.bytes = mbrtoc32 (&iter->cur.wc, iter->cur.ptr, len1,
                                   &iter->state);
       if (iter->cur.bytes == (size_t) -1)
         {
-          /* An invalid multibyte sequence was encountered.  */
-          iter->cur.bytes = 1;
+          /* An invalid multibyte sequence was encountered.
+             Find the length of the smallest invalid prefix of the input,
+             so that the caller sees it as a single encoding error. */
+          for (iter->cur.bytes = 1; iter->cur.bytes < len1; iter->cur.bytes++)
+            {
+              #if GNULIB_MBRTOC32_REGULAR
+              mbszero (&iter->state);
+              #else
+              iter->state = prev_state;
+              #endif
+              if (mbrtoc32 (&iter->cur.wc, iter->cur.ptr, iter->cur.bytes,
+                            &iter->state)
+                  == (size_t) -1)
+                break;
+            }
           iter->cur.wc_valid = false;
           /* Allow the next invocation to continue from a sane state.  */
           #if !GNULIB_MBRTOC32_REGULAR
diff --git a/lib/mbuiterf.h b/lib/mbuiterf.h
index 85c53e73ac..fb6b645786 100644
--- a/lib/mbuiterf.h
+++ b/lib/mbuiterf.h
@@ -139,19 +139,33 @@ mbuiterf_next (struct mbuif_state *ps, const char *iter)
       #if !GNULIB_MBRTOC32_REGULAR
       ps->in_shift = true;
     with_shift:;
+      mbstate_t prev_state = ps->state;
       #endif
       size_t bytes;
       char32_t wc;
-      bytes = mbrtoc32 (&wc, iter, strnlen1 (iter, ps->cur_max), &ps->state);
+      size_t len1 = strnlen1 (iter, ps->cur_max);
+      bytes = mbrtoc32 (&wc, iter, len1, &ps->state);
       if (bytes == (size_t) -1)
         {
-          /* An invalid multibyte sequence was encountered.  */
+          /* An invalid multibyte sequence was encountered.
+             Find the length of the smallest invalid prefix of the input,
+             so that the caller sees it as a single encoding error. */
+          for (bytes = 1; bytes < len1; bytes++)
+            {
+              #if GNULIB_MBRTOC32_REGULAR
+              mbszero (&ps->state);
+              #else
+              ps->state = prev_state;
+              #endif
+              if (mbrtoc32 (&wc, iter, bytes, &ps->state) == (size_t) -1)
+                break;
+            }
           /* Allow the next invocation to continue from a sane state.  */
           #if !GNULIB_MBRTOC32_REGULAR
           ps->in_shift = false;
           #endif
           mbszero (&ps->state);
-          return (mbchar_t) { .ptr = iter, .bytes = 1, .wc_valid = false };
+          return (mbchar_t) { .ptr = iter, .bytes = bytes, .wc_valid = false };
         }
       else if (bytes == (size_t) -2)
         {
-- 
2.39.2

Re: mbcel module for Gnulib?, incomplete multibyte sequences

Reply via email to