On 2023-08-04 16:05, Bruno Haible wrote:
To me, the columns timings of mbiterf and mbuiterf are good enough.
Not to me. Perhaps I'm used to apps like grep and diff where we try to
get as much performance as we can (without going off the deep end of
course).
There are tradeoffs here: mbcel wins on simplicity and performance,
whereas mbiter wins on generality. Since the generality gains (namely,
support for encodings that diff doesn't need) are small for diff, there
is space for something like mbcel.
Emacs is a complex beast. I can understand if the Emacs developers want
an implementationally-simple behaviour, rather than a simple-from-the-
user-perspective behaviour.
I don't agree that the MEE approach is necessarily simpler from the
user's perspective. Although it may be simpler for some apps, it's more
complicated for others, and it's not surprising that Emacs, grep, diff,
etc. take the SEE approach. I expect that Gnulib should support SEE for
apps that prefer it. I'll try to squeeze free some time to think about
how to do that.
For MEE, mbiterf would need something like the attached untested patch,
and mbiter, mbcel, etc. would all need similar patches.
Good point.
The attached patch implements that. Look good to you?
(Although maybe you may want to align the module name to be similar
to mbiterf and mbuiterf : maybe mbitervf and mbuitervf for "very fast"?)
I'll think about naming. I hope for something a bit easier to
spell/remember than "mbuitervf". (To be honest I'm not sold on the
existence of mviterf and mbuiterf, as they're slower than mbcel even if
mbcel is changed to use MEE.)
More important to my mind is how apps choose between SEE and MEE. In
some sense, the choice between SEE and MEE is orthogonal to the choice
between mbcel and mbiter, as it'd be easy to modify mbcel to optionally
support MEE and also easy to modify mbiter to optionally support SEE.
From 9802e8bde49985f5fd8824fd8d03a354096092fc Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Sun, 6 Aug 2023 22:45:51 -0700
Subject: [PATCH] mbiter: return encoding-error prefix lengths
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
When mbrtoc32 returns -1, return the length of the smallest
invalid prefix of the input, instead of always returning length 1.
This makes it more convenient for a caller that translates
characters to process input items according to the WHATWG Encoding
Living Standard (2023-06-19) section 4.1 "Encoders and Decoders"
<https://encoding.spec.whatwg.org/#encoders-and-decoders>.
* lib/mbiter.h (mbiter_multi_next):
* lib/mbiterf.h (mbiterf_next):
* lib/mbuiter.h (mbuiter_multi_next):
* lib/mbuiterf.h (mbuiterf_next):
When mbrtoc32 returns (size_t) -1, don’t simply yield length 1.
Instead, return the length of the smallest prefix of the input
for which mbrtoc32 returns (size_t) -1 instead of (size_t) -2.
---
ChangeLog | 17 +++++++++++++++++
lib/mbiter.h | 22 +++++++++++++++++++---
lib/mbiterf.h | 17 +++++++++++++++--
lib/mbuiter.h | 24 +++++++++++++++++++-----
lib/mbuiterf.h | 20 +++++++++++++++++---
5 files changed, 87 insertions(+), 13 deletions(-)
diff --git a/ChangeLog b/ChangeLog
index 80ac7184d8..d39dfac267 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,20 @@
+2023-08-06 Paul Eggert <egg...@cs.ucla.edu>
+
+ mbiter: return encoding-error prefix lengths
+ When mbrtoc32 returns -1, return the length of the smallest
+ invalid prefix of the input, instead of always returning length 1.
+ This makes it more convenient for a caller that translates
+ characters to process input items according to the WHATWG Encoding
+ Living Standard (2023-06-19) section 4.1 "Encoders and Decoders"
+ <https://encoding.spec.whatwg.org/#encoders-and-decoders>.
+ * lib/mbiter.h (mbiter_multi_next):
+ * lib/mbiterf.h (mbiterf_next):
+ * lib/mbuiter.h (mbuiter_multi_next):
+ * lib/mbuiterf.h (mbuiterf_next):
+ When mbrtoc32 returns (size_t) -1, don’t simply yield length 1.
+ Instead, return the length of the smallest prefix of the input
+ for which mbrtoc32 returns (size_t) -1 instead of (size_t) -2.
+
2023-08-05 Paul Eggert <egg...@cs.ucla.edu>
readutmp: anticipate Y2038 hack for utmp
diff --git a/lib/mbiter.h b/lib/mbiter.h
index b9222fcc3a..33517a2b3e 100644
--- a/lib/mbiter.h
+++ b/lib/mbiter.h
@@ -150,14 +150,30 @@ mbiter_multi_next (struct mbiter_multi *iter)
assert (mbsinit (&iter->state));
#if !GNULIB_MBRTOC32_REGULAR
iter->in_shift = true;
- with_shift:
+ with_shift:;
+ mbstate_t prev_state = iter->state;
#endif
iter->cur.bytes = mbrtoc32 (&iter->cur.wc, iter->cur.ptr,
iter->limit - iter->cur.ptr, &iter->state);
if (iter->cur.bytes == (size_t) -1)
{
- /* An invalid multibyte sequence was encountered. */
- iter->cur.bytes = 1;
+ /* An invalid multibyte sequence was encountered.
+ Find the length of the smallest invalid prefix of the input,
+ so that the caller sees it as a single encoding error. */
+ for (iter->cur.bytes = 1;
+ iter->cur.bytes < iter->limit - iter->cur.ptr;
+ iter->cur.bytes++)
+ {
+ #if GNULIB_MBRTOC32_REGULAR
+ mbszero (&iter->state);
+ #else
+ iter->state = prev_state;
+ #endif
+ if (mbrtoc32 (&iter->cur.wc, iter->cur.ptr, iter->cur.bytes,
+ &iter->state)
+ == (size_t) -1)
+ break;
+ }
iter->cur.wc_valid = false;
/* Allow the next invocation to continue from a sane state. */
#if !GNULIB_MBRTOC32_REGULAR
diff --git a/lib/mbiterf.h b/lib/mbiterf.h
index dea6aaef58..538a61fa4c 100644
--- a/lib/mbiterf.h
+++ b/lib/mbiterf.h
@@ -129,19 +129,32 @@ mbiterf_next (struct mbif_state *ps, const char *iter, const char *endptr)
#if !GNULIB_MBRTOC32_REGULAR
ps->in_shift = true;
with_shift:;
+ mbstate_t prev_state = ps->state;
#endif
size_t bytes;
char32_t wc;
bytes = mbrtoc32 (&wc, iter, endptr - iter, &ps->state);
if (bytes == (size_t) -1)
{
- /* An invalid multibyte sequence was encountered. */
+ /* An invalid multibyte sequence was encountered.
+ Find the length of the smallest invalid prefix of the input,
+ so that the caller sees it as a single encoding error. */
+ for (bytes = 1; bytes < endptr - iter; bytes++)
+ {
+ #if GNULIB_MBRTOC32_REGULAR
+ mbszero (&ps->state);
+ #else
+ ps->state = prev_state;
+ #endif
+ if (mbrtoc32 (&wc, iter, bytes, &ps->state) == (size_t) -1)
+ break;
+ }
/* Allow the next invocation to continue from a sane state. */
#if !GNULIB_MBRTOC32_REGULAR
ps->in_shift = false;
#endif
mbszero (&ps->state);
- return (mbchar_t) { .ptr = iter, .bytes = 1, .wc_valid = false };
+ return (mbchar_t) { .ptr = iter, .bytes = bytes, .wc_valid = false };
}
else if (bytes == (size_t) -2)
{
diff --git a/lib/mbuiter.h b/lib/mbuiter.h
index 862efa3dbe..a432a94e9c 100644
--- a/lib/mbuiter.h
+++ b/lib/mbuiter.h
@@ -159,15 +159,29 @@ mbuiter_multi_next (struct mbuiter_multi *iter)
assert (mbsinit (&iter->state));
#if !GNULIB_MBRTOC32_REGULAR
iter->in_shift = true;
- with_shift:
+ with_shift:;
+ mbstate_t prev_state = iter->state;
#endif
- iter->cur.bytes = mbrtoc32 (&iter->cur.wc, iter->cur.ptr,
- strnlen1 (iter->cur.ptr, iter->cur_max),
+ size_t len1 = strnlen1 (iter->cur.ptr, iter->cur_max);
+ iter->cur.bytes = mbrtoc32 (&iter->cur.wc, iter->cur.ptr, len1,
&iter->state);
if (iter->cur.bytes == (size_t) -1)
{
- /* An invalid multibyte sequence was encountered. */
- iter->cur.bytes = 1;
+ /* An invalid multibyte sequence was encountered.
+ Find the length of the smallest invalid prefix of the input,
+ so that the caller sees it as a single encoding error. */
+ for (iter->cur.bytes = 1; iter->cur.bytes < len1; iter->cur.bytes++)
+ {
+ #if GNULIB_MBRTOC32_REGULAR
+ mbszero (&iter->state);
+ #else
+ iter->state = prev_state;
+ #endif
+ if (mbrtoc32 (&iter->cur.wc, iter->cur.ptr, iter->cur.bytes,
+ &iter->state)
+ == (size_t) -1)
+ break;
+ }
iter->cur.wc_valid = false;
/* Allow the next invocation to continue from a sane state. */
#if !GNULIB_MBRTOC32_REGULAR
diff --git a/lib/mbuiterf.h b/lib/mbuiterf.h
index 85c53e73ac..fb6b645786 100644
--- a/lib/mbuiterf.h
+++ b/lib/mbuiterf.h
@@ -139,19 +139,33 @@ mbuiterf_next (struct mbuif_state *ps, const char *iter)
#if !GNULIB_MBRTOC32_REGULAR
ps->in_shift = true;
with_shift:;
+ mbstate_t prev_state = ps->state;
#endif
size_t bytes;
char32_t wc;
- bytes = mbrtoc32 (&wc, iter, strnlen1 (iter, ps->cur_max), &ps->state);
+ size_t len1 = strnlen1 (iter, ps->cur_max);
+ bytes = mbrtoc32 (&wc, iter, len1, &ps->state);
if (bytes == (size_t) -1)
{
- /* An invalid multibyte sequence was encountered. */
+ /* An invalid multibyte sequence was encountered.
+ Find the length of the smallest invalid prefix of the input,
+ so that the caller sees it as a single encoding error. */
+ for (bytes = 1; bytes < len1; bytes++)
+ {
+ #if GNULIB_MBRTOC32_REGULAR
+ mbszero (&ps->state);
+ #else
+ ps->state = prev_state;
+ #endif
+ if (mbrtoc32 (&wc, iter, bytes, &ps->state) == (size_t) -1)
+ break;
+ }
/* Allow the next invocation to continue from a sane state. */
#if !GNULIB_MBRTOC32_REGULAR
ps->in_shift = false;
#endif
mbszero (&ps->state);
- return (mbchar_t) { .ptr = iter, .bytes = 1, .wc_valid = false };
+ return (mbchar_t) { .ptr = iter, .bytes = bytes, .wc_valid = false };
}
else if (bytes == (size_t) -2)
{
--
2.39.2