Re: mbcel module for Gnulib?, incomplete multibyte sequences

Paul Eggert Thu, 07 Sep 2023 15:26:44 -0700

On 2023-08-07 00:32, Paul Eggert wrote:

I'll think about naming.

It's been a month and I couldn't think of anything better than toshorten the name from mbcel to mcel, so I did that and installed theattached patches. These patches shouldn't affect behavior; that is, theyshould only add new functionality.

With luck this should be enough for diffutils to stop using its ownhomegrown variants of Gnulib modules; I plan to look into that next.

I'm not entirely happy with this approach, as it means packages likediffutils will need to pass --avoid=mbuiterf etc. to gnulib-tool if thepackages prefer mcel for everything. If gnulib-tool gave us a way to saythat mbscasecmp depends on mbuiterf or mbcel (i.e., "or" instead of"and") perhaps we could do something better. But the patch should workas-is for diffutils, and if we come up with something better for Gnulibwe can improve diffutils accordingly.

Although these patches update Gnulib's 'exclude' and 'mbscasecmp'modules to support mcel-prefer, they don't have similar updates forother modules like 'mbsncasecmp' and 'propername' that could also usesupport. Diffutils doesn't use those other modules so I left them alonefor now; they can be updated later as needed (and by then maybe we'llhave a better solution for the --avoid problem).

From b93de66735cd6f935ee0970f8cb26908d113e09d Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Thu, 7 Sep 2023 14:51:55 -0700
Subject: [PATCH 1/7] mcel: new module

* lib/mcel.c, lib/mcel.h, modules/mcel: New files.
---
 ChangeLog    |   5 +
 lib/mcel.c   |   3 +
 lib/mcel.h   | 294 +++++++++++++++++++++++++++++++++++++++++++++++++++
 modules/mcel |  34 ++++++
 4 files changed, 336 insertions(+)
 create mode 100644 lib/mcel.c
 create mode 100644 lib/mcel.h
 create mode 100644 modules/mcel

diff --git a/ChangeLog b/ChangeLog
index d5fc6c2130..d477347b91 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,8 @@
+2023-09-07  Paul Eggert  <egg...@cs.ucla.edu>
+
+	mcel: new module
+	* lib/mcel.c, lib/mcel.h, modules/mcel: New files.
+
 2023-09-07  Bruno Haible  <br...@clisp.org>
 
 	Don't use 'throw ()' in C++ 11 or newer.
diff --git a/lib/mcel.c b/lib/mcel.c
new file mode 100644
index 0000000000..3c2ae46290
--- /dev/null
+++ b/lib/mcel.c
@@ -0,0 +1,3 @@
+#include <config.h>
+#define MCEL_INLINE _GL_EXTERN_INLINE
+#include "mcel.h"
diff --git a/lib/mcel.h b/lib/mcel.h
new file mode 100644
index 0000000000..400604f8b2
--- /dev/null
+++ b/lib/mcel.h
@@ -0,0 +1,294 @@
+/* Multi-byte characters, Error encodings, and Lengths (MCELs)
+   Copyright 2023 Free Software Foundation, Inc.
+
+   This file is free software: you can redistribute it and/or modify
+   it under the terms of the GNU Lesser General Public License as
+   published by the Free Software Foundation; either version 2.1 of the
+   License, or (at your option) any later version.
+
+   This file is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public License
+   along with this program.  If not, see <https://www.gnu.org/licenses/>.  */
+
+/* Written by Paul Eggert.  */
+
+/* The macros in this file implement multi-byte character representation
+   and forward iteration through a multi-byte string.
+   They are simpler and can be faster than the mbiter family.
+   However, they do not support obsolescent encodings like CP864,
+   EBCDIC, Johab, and Shift JIS that glibc also does not support,
+   and it is up to the caller to coalesce encoding-error bytes if desired.
+
+   The mcel_scan function lets code iterate through an array of bytes,
+   supporting character encodings in practical use
+   more simply than using plain mbrtoc32.
+
+   Instead of this single-byte code:
+
+      char *p = ..., *lim = ...;
+      for (; p < lim; p++)
+        process (*p);
+
+   You can use this multi-byte code:
+
+      char *p = ..., *lim = ...;
+      for (mcel_t g; p < lim; p += g.len)
+        {
+	  g = mcel_scan (p, lim);
+	  process (g);
+	}
+
+   You can select from G using G.ch, G.err, and G.len.
+   G is an encoding error if G.err is nonzero, a character otherwise.
+
+   The mcel_scanz function is similar except it works with a
+   string of unknown but positive length that is terminated with '\0'.
+   Instead of this single-byte code:
+
+      char *p = ...;
+      for (; *p; p++)
+	process (*p);
+
+   You can use this multi-byte code:
+
+      char *p = ...;
+      for (mcel_t g; *p; p += g.len)
+	{
+	  g = mcel_scanz (p);
+	  process (g);
+	}
+
+   mcel_scant (P, TERMINATOR) is like mcel_scanz (P) except the
+   string is terminated by TERMINATOR.  The C standard says that the
+   TERMINATORs '\0', '\r', '\n', '.', '/' are safe, as they cannot be
+   a part (even a trailing byte) of a multi-byte character.
+   In practice TERMINATOR is safe if 0 <= TERMINATOR <= 0x2f (ASCII '/').
+
+   mcel_ch (CH, LEN) and mcel_err (ERR) construct mcel_t values.
+
+   mcel_cmp (G1, G2) compares two mcel_t values lexicographically by
+   character or by encoding byte value, with encoding bytes sorting
+   after characters.
+
+   Calls like c32isalpha (G.ch) test G; they return false for encoding
+   errors since calls like c32isalpha (0) return false.  Calls like
+   mcel_tocmp (c32tolower, G1, G2) are like mcel_cmp (G1, G2),
+   but transliterate first.
+
+   Although ISO C and POSIX allow encodings that have shift states or
+   that can produce multiple characters from an indivisible byte sequence,
+   POSIX does not require support for these encodings,
+   they are not in practical use on GNUish platforms,
+   and omitting support for them simplifies the API.  */
+
+#ifndef _MCEL_H
+#define _MCEL_H 1
+
+#if !_GL_CONFIG_H_INCLUDED
+ #error "Please include config.h first."
+#endif
+
+#include <verify.h>
+
+#include <limits.h>
+#include <stddef.h>
+#include <uchar.h>
+
+/* Pacify GCC re type limits.  */
+#if defined __GNUC__ && 4 < __GNUC__ + (3 <= __GNUC_MINOR__)
+# pragma GCC diagnostic ignored "-Wtype-limits"
+#endif
+
+/* The maximum multi-byte character length supported on any platform.
+   This can be less than MB_LEN_MAX because many platforms have a
+   large MB_LEN_MAX to allow for stateful encodings, and mcel does not
+   support these encodings.  MCEL_LEN_MAX is enough for UTF-8, EUC,
+   Shift-JIS, GB18030, etc.  In all multi-byte encodings supported by glibc,
+   0 < MB_CUR_MAX <= MCEL_LEN_MAX <= MB_LEN_MAX.  */
+enum { MCEL_LEN_MAX = MB_LEN_MAX < 4 ? MB_LEN_MAX : 4 };
+
+/* Bounds for mcel_t members.  */
+enum { MCEL_CHAR_MAX = 0x10FFFF };
+enum { MCEL_ERR_MIN = 0x80 };
+
+/* mcel_t is a type representing a character CH or an encoding error byte ERR,
+   along with a count of the LEN bytes that represent CH or ERR.
+   If ERR is zero, CH is a valid character and 0 < LEN <= MCEL_LEN_MAX;
+   otherwise ERR is an encoding error byte, MCEL_ERR_MIN <= ERR,
+   CH == 0, and LEN == 1.  */
+typedef struct
+{
+  char32_t ch;
+  unsigned char err;
+  unsigned char len;
+} mcel_t;
+
+/* Every multi-byte character length fits in mcel_t's LEN.  */
+static_assert (MB_LEN_MAX <= UCHAR_MAX);
+
+/* Shifting an encoding error byte left by this value
+   suffices to sort encoding errors after characters.  */
+enum { MCEL_ERR_SHIFT = 14 };
+static_assert (MCEL_CHAR_MAX < MCEL_ERR_MIN << MCEL_ERR_SHIFT);
+
+/* Unsigned char promotes to int.  */
+static_assert (UCHAR_MAX <= INT_MAX);
+
+/* Bytes have 8 bits, as POSIX requires.  */
+static_assert (CHAR_BIT == 8);
+
+#ifndef _GL_LIKELY
+/* Rely on __builtin_expect, as provided by the module 'builtin-expect'.  */
+# define _GL_LIKELY(cond) __builtin_expect ((cond), 1)
+# define _GL_UNLIKELY(cond) __builtin_expect ((cond), 0)
+#endif
+
+_GL_INLINE_HEADER_BEGIN
+#ifndef MCEL_INLINE
+# define MCEL_INLINE _GL_INLINE
+#endif
+
+/* mcel_t constructors.  */
+MCEL_INLINE mcel_t
+mcel_ch (char32_t ch, size_t len)
+{
+  assume (0 < len);
+  assume (len <= MCEL_LEN_MAX);
+  assume (ch <= MCEL_CHAR_MAX);
+  return (mcel_t) {.ch = ch, .len = len};
+}
+MCEL_INLINE mcel_t
+mcel_err (unsigned char err)
+{
+  assume (MCEL_ERR_MIN <= err);
+  return (mcel_t) {.err = err, .len = 1};
+}
+
+/* Compare C1 and C2, with encoding errors sorting after characters.
+   Return <0, 0, >0 for <, =, >.  */
+MCEL_INLINE int
+mcel_cmp (mcel_t c1, mcel_t c2)
+{
+  int ch1 = c1.ch, ch2 = c2.ch;
+  return ((c1.err - c2.err) * (1 << MCEL_ERR_SHIFT)) + (ch1 - ch2);
+}
+
+/* Apply the uchar translator TO to C1 and C2 and compare the results,
+   with encoding errors sorting after characters,
+   Return <0, 0, >0 for <, =, >.  */
+MCEL_INLINE int
+mcel_tocmp (wint_t (*to) (wint_t), mcel_t c1, mcel_t c2)
+{
+  int cmp = mcel_cmp (c1, c2);
+  if (_GL_LIKELY ((c1.err - c2.err) | !cmp))
+    return cmp;
+  int ch1 = to (c1.ch), ch2 = to (c2.ch);
+  return ch1 - ch2;
+}
+
+/* Whether C represents itself as a Unicode character
+   when it is the first byte of a single- or multi-byte character.
+   These days it is safe to assume ASCII, so do not support
+   obsolescent encodings like CP864, EBCDIC, Johab, and Shift JIS.  */
+MCEL_INLINE bool
+mcel_isbasic (char c)
+{
+  return _GL_LIKELY (0 <= c && c < MCEL_ERR_MIN);
+}
+
+/* With mcel there should be no need for the performance overhead of
+   replacing glibc mbrtoc32, as callers shouldn't care whether the
+   C locale treats a byte with the high bit set as an encoding error.  */
+#ifdef __GLIBC__
+# undef mbrtoc32
+#endif
+
+/* Scan bytes from P inclusive to LIM exclusive.  P must be less than LIM.
+   Return the character or encoding error starting at P.  */
+MCEL_INLINE mcel_t
+mcel_scan (char const *p, char const *lim)
+{
+  /* Handle ASCII quickly to avoid the overhead of calling mbrtoc32.
+     In supported encodings, the first byte of a multi-byte character
+     cannot be an ASCII byte.  */
+  char c = *p;
+  if (mcel_isbasic (c))
+    return mcel_ch (c, 1);
+
+  /* An initial mbstate_t; initialization optimized for some platforms.
+     For details about these and other platforms, see wchar.in.h.  */
+#if defined __GLIBC__ && 2 < __GLIBC__ + (2 <= __GLIBC_MINOR__)
+  /* Although only a trivial optimization, it's worth it for GNU.  */
+  mbstate_t mbs; mbs.__count = 0;
+#elif (defined __FreeBSD__ || defined __DragonFly__ || defined __OpenBSD__ \
+       || (defined __APPLE__ && defined __MACH__))
+  /* These platforms have 128-byte mbstate_t.  What were they thinking?
+     Initialize just for supported encodings (UTF-8, EUC, etc.).
+     Avoid memset because some compilers generate function call code.  */
+  struct mbhidden { char32_t ch; int utf8_want, euc_want; }
+    _GL_ATTRIBUTE_MAY_ALIAS;
+  union { mbstate_t m; struct mbhidden s; } u;
+  u.s.ch = u.s.utf8_want = u.s.euc_want = 0;
+# define mbs u.m
+#elif defined __NetBSD__
+  /* Experiments on both 32- and 64-bit NetBSD platforms have
+     shown that it doesn't work to clear fewer than 24 bytes.  */
+  struct mbhidden { long long int a, b, c; } _GL_ATTRIBUTE_MAY_ALIAS;
+  union { mbstate_t m; struct mbhidden s; } u;
+  u.s.a = u.s.b = u.s.c = 0;
+# define mbs u.m
+#else
+  /* mbstate_t has unknown structure or is not worth optimizing.  */
+  mbstate_t mbs = {0};
+#endif
+
+  char32_t ch;
+  size_t len = mbrtoc32 (&ch, p, lim - p, &mbs);
+
+  /* Any LEN with top bit set is an encoding error, as LEN == (size_t) -3
+     is not supported and MB_LEN_MAX is small.  */
+  if (_GL_UNLIKELY ((size_t) -1 / 2 < len))
+    return mcel_err (c);
+
+  /* A multi-byte character.  LEN must be positive,
+     as *P != '\0' and shift sequences are not supported.  */
+  return mcel_ch (ch, len);
+}
+
+/* Scan bytes from P, a byte sequence terminated by TERMINATOR.
+   If *P == TERMINATOR, scan just that byte; otherwise scan
+   bytes up to but not including TERMINATOR.
+   TERMINATOR must be ASCII, and should be '\0', '\r', '\n', '.', or '/'.
+   Return the character or encoding error starting at P.  */
+MCEL_INLINE mcel_t
+mcel_scant (char const *p, char terminator)
+{
+  /* Handle ASCII quickly for speed.  */
+  if (mcel_isbasic (*p))
+    return mcel_ch (*p, 1);
+
+  /* Defer to mcel_scan for non-ASCII.  Compute length with code that
+     is typically faster than strnlen.  */
+  char const *lim = p + 1;
+  for (int i = 0; i < MCEL_LEN_MAX - 1; i++)
+    lim += *lim != terminator;
+  return mcel_scan (p, lim);
+}
+
+/* Scan bytes from P, a byte sequence terminated by '\0'.
+   If *P == '\0', scan just that byte; otherwise scan
+   bytes up to but not including '\0'.
+   Return the character or encoding error starting at P.  */
+MCEL_INLINE mcel_t
+mcel_scanz (char const *p)
+{
+  return mcel_scant (p, '\0');
+}
+
+_GL_INLINE_HEADER_END
+
+#endif /* _MCEL_H */
diff --git a/modules/mcel b/modules/mcel
new file mode 100644
index 0000000000..59ca633641
--- /dev/null
+++ b/modules/mcel
@@ -0,0 +1,34 @@
+Description:
+Multibye Characters, Encoding errors, and Lengths
+
+Files:
+lib/mcel.c
+lib/mcel.h
+
+Depends-on:
+assert-h
+extern-inline
+limits-h
+mbrtoc32
+stdbool
+uchar
+verify
+
+configure.ac:
+
+Makefile.am:
+lib_SOURCES += mcel.c mcel.h
+
+Include:
+"mcel.h"
+
+Link:
+$(LTLIBUNISTRING) when linking with libtool, $(LIBUNISTRING) otherwise
+$(MBRTOWC_LIB)
+$(LTLIBC32CONV) when linking with libtool, $(LIBC32CONV) otherwise
+
+License:
+LGPLv2+
+
+Maintainer:
+all
-- 
2.39.2

From 988b7b2f88972e26d7b828c3f0925d50c2fb354e Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Thu, 7 Sep 2023 14:51:56 -0700
Subject: [PATCH 2/7] mcel-tests: new module

* modules/mcel-tests, tests/test-mcel.c: New files
---
 ChangeLog          |   3 +
 modules/mcel-tests |  12 ++++
 tests/test-mcel.c  | 137 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 152 insertions(+)
 create mode 100644 modules/mcel-tests
 create mode 100644 tests/test-mcel.c

diff --git a/ChangeLog b/ChangeLog
index d477347b91..1b10dda6a9 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,8 @@
 2023-09-07  Paul Eggert  <egg...@cs.ucla.edu>
 
+	mcel-tests: new module
+	* modules/mcel-tests, tests/test-mcel.c: New files
+
 	mcel: new module
 	* lib/mcel.c, lib/mcel.h, modules/mcel: New files.
 
diff --git a/modules/mcel-tests b/modules/mcel-tests
new file mode 100644
index 0000000000..4b9ba0eeaf
--- /dev/null
+++ b/modules/mcel-tests
@@ -0,0 +1,12 @@
+Files:
+tests/test-mcel.c
+
+Depends-on:
+assert-h
+setlocale
+
+configure.ac:
+
+Makefile.am:
+TESTS += test-mcel
+check_PROGRAMS += test-mcel
diff --git a/tests/test-mcel.c b/tests/test-mcel.c
new file mode 100644
index 0000000000..2977ec06a0
--- /dev/null
+++ b/tests/test-mcel.c
@@ -0,0 +1,137 @@
+/* Test <mcel.h>
+   Copyright 2023 Free Software Foundation, Inc.
+
+   This program is free software: you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation, either version 3 of the License, or
+   (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program.  If not, see <https://www.gnu.org/licenses/>.  */
+
+#include <config.h>
+
+#include <mcel.h>
+
+#include <locale.h>
+
+#include "macros.h"
+
+static wint_t
+to_ascii (wint_t c)
+{
+  return c & 0x7f;
+}
+
+static int
+sgn (int i)
+{
+  return (i > 0) - (i < 0);
+}
+
+static void
+test_mcel_vs_mbrtoc32 (unsigned char uc, mcel_t c, size_t n, char32_t ch)
+{
+  ASSERT (!c.err == (n <= MB_LEN_MAX));
+  ASSERT (c.err
+          ? c.err == uc && c.ch == 0 && c.len == 1
+          : c.ch == ch && c.len == (n ? n : 1));
+}
+
+int
+main (void)
+{
+  /* configure should already have checked that the locale is supported.  */
+  if (setlocale (LC_ALL, "") == NULL)
+    return 1;
+
+  mcel_t prev;
+  for (int ch = 0; ch < 0x80; ch++)
+    {
+      mcel_t c = mcel_ch (ch, 1);
+      ASSERT (c.ch == ch);
+      ASSERT (c.len == 1);
+      ASSERT (!c.err);
+      ASSERT (mcel_cmp (c, c) == 0);
+      ASSERT (mcel_tocmp (to_ascii, c, c) == 0);
+      if (ch)
+        {
+          ASSERT (mcel_cmp (prev, c) < 0);
+          ASSERT (mcel_cmp (c, prev) > 0);
+          ASSERT (mcel_tocmp (to_ascii, prev, c) < 0);
+          ASSERT (mcel_tocmp (to_ascii, c, prev) > 0);
+        }
+      ASSERT (mcel_isbasic (ch));
+      prev = c;
+    }
+  for (char ch = CHAR_MIN; ; ch++)
+    {
+      ASSERT (mcel_isbasic (ch) == (0 <= ch && ch <= 0x7f));
+      if (ch == CHAR_MAX)
+        break;
+    }
+  for (int ch = 0x80; ch < 0x200; ch++)
+    {
+      mcel_t c = mcel_ch (ch, 2);
+      ASSERT (c.ch == ch);
+      ASSERT (c.len == 2);
+      ASSERT (!c.err);
+      ASSERT (mcel_cmp (c, c) == 0);
+      ASSERT (mcel_tocmp (to_ascii, c, c) == 0);
+      ASSERT (mcel_cmp (prev, c) < 0);
+      ASSERT (mcel_cmp (c, prev) > 0);
+      ASSERT (mcel_tocmp (to_ascii, c, c) == 0);
+      int cmp = to_ascii (c.ch) ? -1 : 1;
+      ASSERT (sgn (mcel_tocmp (to_ascii, prev, c)) == cmp);
+      ASSERT (sgn (mcel_tocmp (to_ascii, c, prev)) == -cmp);
+      prev = c;
+    }
+  for (unsigned char err = 0x80; ; err++)
+    {
+      mcel_t c = mcel_err (err);
+      ASSERT (!c.ch);
+      ASSERT (c.len == 1);
+      ASSERT (c.err == err);
+      ASSERT (mcel_cmp (c, c) == 0);
+      ASSERT (mcel_cmp (prev, c) < 0);
+      ASSERT (mcel_cmp (c, prev) > 0);
+      ASSERT (mcel_tocmp (to_ascii, c, c) == 0);
+      ASSERT (mcel_tocmp (to_ascii, prev, c) < 0);
+      ASSERT (mcel_tocmp (to_ascii, c, prev) > 0);
+      prev = c;
+      if (err == (unsigned char) -1)
+        break;
+    }
+
+  for (int i = CHAR_MIN; i <= CHAR_MAX; i++)
+    for (int j = CHAR_MIN; i <= CHAR_MAX; i++)
+      for (int k = CHAR_MIN; k <= CHAR_MAX; k++)
+        {
+          char const ijk[] = {i, j, k};
+          mbstate_t mbs = {0};
+          char32_t ch;
+          size_t n = mbrtoc32 (&ch, ijk, sizeof ijk, &mbs);
+          mcel_t c = mcel_scan (ijk, ijk + sizeof ijk);
+          test_mcel_vs_mbrtoc32 (i, c, n, ch);
+
+          static char const terminator[] = "\r\n./";
+          for (int ti = 0; ti < sizeof terminator; ti++)
+            {
+              char t = terminator[ti];
+              if (i == t)
+                continue;
+              mcel_t d = mcel_scant (ijk, t);
+              ASSERT (c.ch == d.ch && c.err == d.err && c.len == d.len);
+              if (!t)
+                {
+                  mcel_t z = mcel_scanz (ijk);
+                  ASSERT (d.ch == z.ch && d.err == z.err && d.len == z.len);
+                }
+            }
+        }
+}
-- 
2.39.2

From d9ad9a68fd418286bcaf0b4c71c3ae2fc63a09c5 Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Thu, 7 Sep 2023 14:51:56 -0700
Subject: [PATCH 3/7] mcel-bench-tests: new module

* modules/mcel-bench-tests, tests/bench-mcel.c: New files.
* tests/bench-multibyte.h (TEXT_LATIN_ASCII_LINE1)
(TEXT_FRENCH_UTF8_LINE1, TEXT_GREEK_UTF8_LINE1)
(TEXT_CHINESE_UTF8_LINE1): New macros.
(text_random_bytes): New constant.
* tests/bench.h (timing_output): Mark with _GL_UNUSED,
since bench-mcel.c does not use it.
---
 ChangeLog                |   9 +
 modules/mcel-bench-tests |  23 +++
 tests/bench-mcel.c       | 369 +++++++++++++++++++++++++++++++++++++++
 tests/bench-multibyte.h  | 139 +++++++++++++++
 tests/bench.h            |   2 +-
 5 files changed, 541 insertions(+), 1 deletion(-)
 create mode 100644 modules/mcel-bench-tests
 create mode 100644 tests/bench-mcel.c

diff --git a/ChangeLog b/ChangeLog
index 1b10dda6a9..cbb1979acb 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,14 @@
 2023-09-07  Paul Eggert  <egg...@cs.ucla.edu>
 
+	mcel-bench-tests: new module
+	* modules/mcel-bench-tests, tests/bench-mcel.c: New files.
+	* tests/bench-multibyte.h (TEXT_LATIN_ASCII_LINE1)
+	(TEXT_FRENCH_UTF8_LINE1, TEXT_GREEK_UTF8_LINE1)
+	(TEXT_CHINESE_UTF8_LINE1): New macros.
+	(text_random_bytes): New constant.
+	* tests/bench.h (timing_output): Mark with _GL_UNUSED,
+	since bench-mcel.c does not use it.
+
 	mcel-tests: new module
 	* modules/mcel-tests, tests/test-mcel.c: New files
 
diff --git a/modules/mcel-bench-tests b/modules/mcel-bench-tests
new file mode 100644
index 0000000000..ea64a2f60c
--- /dev/null
+++ b/modules/mcel-bench-tests
@@ -0,0 +1,23 @@
+Files:
+tests/bench-mcel.c
+tests/bench-multibyte.h
+tests/bench.h
+
+Depends-on:
+mbiter
+mbiterf
+mbrtoc32-regular
+mbuiter
+mbuiterf
+mcel
+setlocale
+striconv
+getrusage
+gettimeofday
+
+configure.ac:
+
+Makefile.am:
+noinst_PROGRAMS += bench-mcel
+bench_mcel_CPPFLAGS = $(AM_CPPFLAGS) -DNDEBUG
+bench_mcel_LDADD = $(LDADD) $(LIBUNISTRING) $(SETLOCALE_LIB) $(MBRTOWC_LIB) $(LIBC32CONV)
diff --git a/tests/bench-mcel.c b/tests/bench-mcel.c
new file mode 100644
index 0000000000..3fbfe122c9
--- /dev/null
+++ b/tests/bench-mcel.c
@@ -0,0 +1,369 @@
+/* Benchmark mcel and some alternatives
+   Copyright 2023 Free Software Foundation, Inc.
+
+   This program is free software: you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation, either version 3 of the License, or
+   (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program.  If not, see <https://www.gnu.org/licenses/>.  */
+
+#include <config.h>
+
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>
+#include <locale.h>
+#include <uchar.h>
+
+#include "bench.h"
+#include "bench-multibyte.h"
+#include "mbiter.h"
+#include "mbiterf.h"
+#include "mbuiter.h"
+#include "mbuiterf.h"
+#include "mcel.h"
+
+typedef unsigned long long (*test_function) (char const *, char const *, int);
+
+static unsigned long long
+noop_test (char const *text, char const *text_end, int repeat)
+{
+  unsigned long long sum = 0;
+
+  for (int count = 0; count < repeat; count++)
+    {
+      char const *iter;
+      for (iter = text; iter < text_end; iter++)
+        sum += (uintptr_t) iter;
+    }
+
+  return sum;
+}
+
+static unsigned long long
+single_test (char const *text, char const *text_end, int repeat)
+{
+  unsigned long long sum = 0;
+
+  for (int count = 0; count < repeat; count++)
+    for (char const *iter = text; iter < text_end; )
+      {
+        unsigned char c = *iter++;
+        sum += c;
+      }
+
+  return sum;
+}
+
+static unsigned long long
+mbiter_test (char const *text, char const *text_end, int repeat)
+{
+  unsigned long long sum = 0;
+
+  size_t text_len = text_end - text;
+  for (int count = 0; count < repeat; count++)
+    {
+      mbi_iterator_t iter;
+      for (mbi_init (iter, text, text_len); mbi_avail (iter); )
+        {
+          mbchar_t cur = mbi_cur (iter);
+          mbi_advance (iter);
+          sum += cur.wc_valid ? cur.wc : (unsigned char) *mb_ptr (cur) << 16;
+        }
+    }
+
+  return sum;
+}
+
+static unsigned long long
+mbiterf_test (char const *text, char const *text_end, int repeat)
+{
+  unsigned long long sum = 0;
+
+  for (int count = 0; count < repeat; count++)
+    {
+      mbif_state_t state;
+      char const *iter;
+      for (mbif_init (state), iter = text; mbif_avail (state, iter, text_end); )
+        {
+          mbchar_t cur = mbif_next (state, iter, text_end);
+          iter += mb_len (cur);
+          sum += cur.wc_valid ? cur.wc : (unsigned char) *mb_ptr (cur) << 16;
+        }
+    }
+
+  return sum;
+}
+
+static unsigned long long
+mbuiter_test (char const *text, char const *text_end, int repeat)
+{
+  unsigned long long sum = 0;
+
+  for (int count = 0; count < repeat; count++)
+    for (char const *t = text; t < text_end; t++)
+      {
+        mbui_iterator_t iter;
+        for (mbui_init (iter, t); mbui_avail (iter); )
+          {
+            mbchar_t cur = mbui_cur (iter);
+            mbui_advance (iter);
+            sum += cur.wc_valid ? cur.wc : (unsigned char) *mb_ptr (cur) << 16;
+          }
+        t = mbui_cur_ptr (iter);
+      }
+
+  return sum;
+}
+
+static unsigned long long
+mbuiterf_test (char const *text, _GL_UNUSED char const *text_end, int repeat)
+{
+  unsigned long long sum = 0;
+
+  for (int count = 0; count < repeat; count++)
+    for (char const *t = text; t < text_end; t++)
+      {
+        mbuif_state_t state;
+        char const *iter;
+        for (mbuif_init (state), iter = t; mbuif_avail (state, iter); )
+          {
+            mbchar_t cur = mbuif_next (state, iter);
+            iter += mb_len (cur);
+            sum += cur.wc_valid ? cur.wc : (unsigned char) *mb_ptr (cur) << 16;
+          }
+        t = iter;
+      }
+
+  return sum;
+}
+
+static unsigned long long
+mcel_test (char const *text, char const *text_end, int repeat)
+{
+  unsigned long long sum = 0;
+
+  for (int count = 0; count < repeat; count++)
+    for (char const *iter = text; iter < text_end; )
+      {
+        mcel_t g = mcel_scan (iter, text_end);
+        iter += g.len;
+        sum += g.ch | (g.err << 16);
+      }
+
+  return sum;
+}
+
+static unsigned long long
+mcuel_test (char const *text, char const *text_end, int repeat)
+{
+  unsigned long long sum = 0;
+
+  for (int count = 0; count < repeat; count++)
+    for (char const *t = text; t < text_end; t++)
+      {
+        char const *iter = t;
+        while (*iter)
+          {
+            mcel_t g = mcel_scanz (iter);
+            iter += g.len;
+            sum += g.ch | (g.err << 16);
+          }
+        t = iter;
+      }
+
+  return sum;
+}
+
+static unsigned long long
+do_1_test (test_function test, char const *text,
+           char const *text_end, int repeat, struct timings_state *ts)
+{
+  timing_start (ts);
+  unsigned long long sum = test (text, text_end, repeat);
+  timing_end (ts);
+  return sum;
+}
+
+static void
+do_test (char test, int repeat, char const *locale_name,
+         char const *text, size_t text_len)
+{
+  if (setlocale (LC_ALL, locale_name) != NULL)
+    {
+      char const *text_end = text + text_len;
+
+      static struct
+      {
+        char const *name;
+        test_function fn;
+        struct timings_state ts;
+        unsigned long long volatile sum;
+      } testdesc[] = {
+        { "noop", noop_test },
+        { "single", single_test },
+        { "mbiter", mbiter_test },
+        { "mbiterf", mbiterf_test },
+        { "mbuiter", mbuiter_test },
+        { "mbuiterf", mbuiterf_test },
+        { "mcel", mcel_test },
+        { "mcuel", mcuel_test },
+      };
+      int ntestdesc = sizeof testdesc / sizeof *testdesc;
+      for (int i = 0; i < ntestdesc; i++)
+        testdesc[i].sum =
+          do_1_test (testdesc[i].fn, text, text_end, repeat, &testdesc[i].ts);
+
+      setlocale (LC_ALL, "C");
+
+      static bool header_printed;
+      if (!header_printed)
+        {
+          printf (" ");
+          for (int i = 0; i < ntestdesc; i++)
+            printf (" %8s", testdesc[i].name);
+          printf ("\n");
+          header_printed = true;
+        }
+
+      printf ("%c", test);
+      for (int i = 0; i < ntestdesc; i++)
+        {
+          double user_usec = testdesc[i].ts.user_usec;
+          double sys_usec = testdesc[i].ts.sys_usec;
+          printf (" %8.3f", (user_usec + sys_usec) / 1e6);
+        }
+      printf ("\n");
+    }
+  else
+    {
+      printf ("Skipping test: locale %s not installed.\n", locale_name);
+    }
+}
+
+/* Performs some or all of the following tests:
+     A - ASCII text, C locale
+     B - ASCII text, UTF-8 locale
+     C - French text, C locale
+     D - French text, ISO-8859-1 locale
+     E - French text, UTF-8 locale
+     F - Greek text, C locale
+     G - Greek text, ISO-8859-7 locale
+     H - Greek text, UTF-8 locale
+     I - Chinese text, UTF-8 locale
+     J - Chinese text, GB18030 locale
+     K - Random bytes, C locale
+     L - Random bytes, UTF-8 locale
+     a - short ASCII text, C locale
+     b - short ASCII text, UTF-8 locale
+     e - short French text, UTF-8 locale
+     h - short Greek text, UTF-8 locale
+     i - short Chinese text, UTF-8 locale
+   Pass the tests to be performed as first argument.  */
+int
+main (int argc, char *argv[])
+{
+  if (argc != 3)
+    {
+      fprintf (stderr, "Usage: %s TESTS REPETITIONS\n", argv[0]);
+
+      fprintf (stderr, "Example: %s ABCDEFGHIJKabehi 100000\n", argv[0]);
+      exit (1);
+    }
+
+  char const *tests = argv[1];
+  int repeat = atoi (argv[2]);
+
+  text_init ();
+
+  /* Execute each test.  */
+  size_t i;
+  for (i = 0; i < strlen (tests); i++)
+    {
+      char test = tests[i];
+
+      switch (test)
+        {
+        case 'A':
+          do_test (test, repeat, "C", text_latin_ascii,
+                   strlen (text_latin_ascii));
+          break;
+        case 'a':
+          do_test (test, repeat, "C", TEXT_LATIN_ASCII_LINE1,
+                   strlen (TEXT_LATIN_ASCII_LINE1));
+          break;
+        case 'B':
+          do_test (test, repeat, "en_US.UTF-8", text_latin_ascii,
+                   strlen (text_latin_ascii));
+          break;
+        case 'b':
+          do_test (test, repeat, "en_US.UTF-8", TEXT_LATIN_ASCII_LINE1,
+                   strlen (TEXT_LATIN_ASCII_LINE1));
+          break;
+        case 'C':
+          do_test (test, repeat, "C", text_french_iso8859,
+                   strlen (text_french_iso8859));
+          break;
+        case 'D':
+          do_test (test, repeat, "fr_FR.ISO-8859-1", text_french_iso8859,
+                   strlen (text_french_iso8859));
+          break;
+        case 'E':
+          do_test (test, repeat, "en_US.UTF-8", text_french_utf8,
+                   strlen (text_french_utf8));
+          break;
+        case 'e':
+          do_test (test, repeat, "en_US.UTF-8", TEXT_FRENCH_UTF8_LINE1,
+                   strlen (TEXT_FRENCH_UTF8_LINE1));
+          break;
+        case 'F':
+          do_test (test, repeat, "C", text_greek_iso8859,
+                   strlen (text_greek_iso8859));
+          break;
+        case 'G':
+          do_test (test, repeat, "el_GR.ISO-8859-7", text_greek_iso8859,
+                   strlen (text_greek_iso8859));
+          break;
+        case 'H':
+          do_test (test, repeat, "en_US.UTF-8", text_greek_utf8,
+                   strlen (text_greek_utf8));
+          break;
+        case 'h':
+          do_test (test, repeat, "en_US.UTF-8", TEXT_GREEK_UTF8_LINE1,
+                   strlen (TEXT_GREEK_UTF8_LINE1));
+          break;
+        case 'I':
+          do_test (test, repeat, "en_US.UTF-8", text_chinese_utf8,
+                   strlen (text_chinese_utf8));
+          break;
+        case 'i':
+          do_test (test, repeat, "en_US.UTF-8", TEXT_CHINESE_UTF8_LINE1,
+                   strlen (TEXT_CHINESE_UTF8_LINE1));
+          break;
+        case 'J':
+          do_test (test, repeat, "zh_CN.GB18030", text_chinese_gb18030,
+                   strlen (text_chinese_gb18030));
+          break;
+        case 'K':
+          do_test (test, repeat, "C", text_random_bytes,
+                   sizeof text_random_bytes - 1);
+          break;
+        case 'L':
+          do_test (test, repeat, "en_US.UTF-8", text_random_bytes,
+                   sizeof text_random_bytes - 1);
+          break;
+        default:
+          /* Ignore.  */
+          ;
+        }
+    }
+
+  return 0;
+}
diff --git a/tests/bench-multibyte.h b/tests/bench-multibyte.h
index d1aec951a0..6e475ada15 100644
--- a/tests/bench-multibyte.h
+++ b/tests/bench-multibyte.h
@@ -21,7 +21,9 @@
    Liber I, Sermo IX
  */
 static char const text_latin_ascii[] =
+#define TEXT_LATIN_ASCII_LINE1 \
   "ibam forte via sacra, sicut meus est mos,\n"
+  TEXT_LATIN_ASCII_LINE1
   "nescio quid meditans nugarum, totus in illis:\n"
   "accurrit quidam notus mihi nomine tantum\n"
   "arreptaque manu 'quid agis, dulcissime rerum?'\n"
@@ -102,7 +104,9 @@ static char const text_latin_ascii[] =
   ;
 
 static char const text_french_utf8[] =
+#define TEXT_FRENCH_UTF8_LINE1 \
   "J'errais par hasard sur une voie sacrée, comme c'est ma coutume,\n"
+  TEXT_FRENCH_UTF8_LINE1
   "Méditant je ne sais quoi de frivole, totalement absorbé par ces pensées :\n"
   "Arrive soudain quelqu'un de connu, seulement par son nom,\n"
   "Et me saisissant par la main, il dit : « Comment vas-tu, ô douceur des choses ? »\n"
@@ -185,7 +189,9 @@ static char const text_french_utf8[] =
 static char const *text_french_iso8859;
 
 static char const text_greek_utf8[] =
+#define TEXT_GREEK_UTF8_LINE1 \
   "περιπάτων μέντοι κατά την ιερή οδό, όπως είναι η συνήθειά μου,\n"
+  TEXT_GREEK_UTF8_LINE1
   "σκεφτόμενος άσχημα πράγματα, πλήρως αφοσιωμένος σε αυτά:\n"
   "έρχεται ένας γνωστός με όνομα μόνον,\n"
   "και αρπάζοντας το χέρι μου, λέει \"τι κάνεις, πιο γλυκέ των πραγμάτων;\"\n"
@@ -261,7 +267,9 @@ static char const text_greek_utf8[] =
 static char const *text_greek_iso8859;
 
 static char const text_chinese_utf8[] =
+#define TEXT_CHINESE_UTF8_LINE1 \
   "我偶然走在圣路上，正如我的习惯，\n"
+  TEXT_CHINESE_UTF8_LINE1
   "心里想着一些无聊的事情，全神贯注其中：\n"
   "突然有个熟人从我身边跑过，只知道我的名字，\n"
   "他一把抓住我的手说：“你好，最甜蜜的人！”\n"
@@ -337,6 +345,137 @@ static char const text_chinese_utf8[] =
 
 static char const *text_chinese_gb18030;
 
+/* 2000 random bytes (including NUL bytes) followed by NUL.  Generated by:
+   od -An -N2000 -to1 /dev/urandom | sed 's/  *''/\\/g; s/.*''/  "&"/'
+   in the C locale.  */
+static char const text_random_bytes[] =
+  "\002\025\262\356\251\052\313\037\234\000\160\247\162\250\011\140"
+  "\212\121\014\223\070\256\312\363\204\362\130\226\374\256\365\364"
+  "\173\131\373\270\066\034\021\216\072\021\050\250\106\146\167\327"
+  "\031\301\160\324\346\334\250\111\066\377\315\004\355\167\225\176"
+  "\257\070\334\005\354\337\320\037\272\172\156\042\312\077\134\217"
+  "\116\240\022\232\014\244\225\114\354\204\224\212\130\062\360\312"
+  "\076\323\154\332\127\230\050\377\263\165\346\371\244\070\140\120"
+  "\371\313\311\232\256\244\150\003\320\132\045\257\001\112\057\264"
+  "\111\334\370\033\022\246\347\224\032\112\130\166\263\140\204\310"
+  "\323\315\214\265\313\172\275\100\020\311\215\207\061\031\000\101"
+  "\132\044\050\020\372\003\011\347\135\120\026\367\376\213\336\061"
+  "\117\223\005\217\330\217\227\121\134\011\353\137\247\255\000\353"
+  "\376\147\004\152\261\306\106\341\364\355\067\047\261\167\076\066"
+  "\102\353\026\203\165\226\245\270\036\222\003\134\112\200\375\314"
+  "\204\023\351\021\240\123\211\165\103\210\100\030\377\205\162\307"
+  "\027\024\342\231\216\121\113\243\151\243\045\237\351\346\016\320"
+  "\374\127\314\272\226\371\072\030\134\021\311\202\252\060\263\305"
+  "\262\261\043\065\341\265\364\225\047\140\347\025\073\054\060\053"
+  "\345\202\031\234\246\201\164\313\251\076\022\214\121\331\376\160"
+  "\237\145\122\264\214\073\277\254\020\020\322\030\006\221\261\355"
+  "\366\023\162\326\137\147\137\132\005\223\312\123\103\330\127\341"
+  "\207\240\175\036\277\075\213\255\031\223\366\060\350\361\271\122"
+  "\274\145\174\030\333\230\077\323\104\031\062\374\077\345\276\154"
+  "\224\006\346\376\101\040\156\060\227\172\336\156\305\050\225\236"
+  "\207\233\253\232\062\021\003\110\035\266\315\342\114\162\126\050"
+  "\146\216\165\345\125\061\137\350\307\236\205\350\026\221\267\305"
+  "\051\115\130\050\103\141\077\251\131\326\262\232\164\060\056\165"
+  "\152\027\145\144\323\030\065\247\321\317\153\316\363\232\271\222"
+  "\372\012\223\256\064\354\243\305\002\333\075\143\366\214\270\016"
+  "\041\320\336\070\250\070\354\354\373\157\365\204\122\215\131\246"
+  "\176\147\122\221\101\331\366\001\325\354\271\227\010\152\050\060"
+  "\011\254\317\037\107\024\374\127\042\250\012\123\355\216\207\012"
+  "\210\007\252\043\244\023\125\142\246\250\325\275\136\247\260\177"
+  "\363\063\063\315\263\134\134\347\373\005\001\373\354\274\302\177"
+  "\253\343\324\031\050\126\371\251\146\224\276\374\100\054\165\011"
+  "\040\032\243\014\320\030\237\111\065\353\043\057\141\343\256\265"
+  "\134\221\214\250\242\171\056\277\146\370\031\057\334\352\235\154"
+  "\240\233\027\106\206\317\237\236\356\325\241\272\064\137\227\263"
+  "\371\043\003\327\117\320\026\313\323\244\077\174\067\273\136\213"
+  "\370\057\170\024\266\046\075\045\234\257\311\230\216\303\367\357"
+  "\217\021\312\241\210\323\341\220\331\017\354\113\054\171\377\007"
+  "\341\171\157\145\371\025\005\112\137\241\271\352\156\161\107\231"
+  "\006\365\331\020\023\366\337\336\341\352\014\213\045\337\206\032"
+  "\116\230\206\000\353\074\311\240\102\004\124\251\212\261\336\322"
+  "\251\344\347\040\205\025\267\324\315\142\164\366\330\047\066\122"
+  "\205\270\200\316\142\252\351\246\350\122\217\336\222\266\350\124"
+  "\350\370\170\360\256\066\206\043\175\335\054\037\112\131\166\266"
+  "\245\054\221\370\370\344\310\332\006\253\317\071\161\310\243\035"
+  "\367\212\233\274\043\331\140\212\353\017\022\277\162\020\027\356"
+  "\130\040\140\350\016\205\311\156\102\144\250\100\123\334\374\300"
+  "\171\353\317\273\126\204\016\200\346\155\172\016\047\357\135\277"
+  "\045\216\276\214\017\202\231\000\377\176\005\043\301\277\274\052"
+  "\223\101\127\212\260\123\011\051\067\110\330\322\061\272\225\127"
+  "\061\011\031\305\043\243\352\376\376\257\035\050\304\267\174\177"
+  "\021\171\220\356\004\166\036\307\044\005\305\266\136\042\156\043"
+  "\240\226\115\021\202\020\354\011\042\355\156\237\323\006\164\317"
+  "\054\212\330\361\373\114\324\325\136\041\367\024\025\247\330\207"
+  "\136\075\004\067\220\036\034\231\166\135\066\366\041\061\055\256"
+  "\370\340\323\026\234\333\356\076\174\267\124\104\265\050\035\061"
+  "\102\052\157\034\167\217\362\031\064\035\313\276\334\317\223\363"
+  "\166\306\004\341\010\204\216\066\325\073\170\000\024\263\366\116"
+  "\336\346\023\244\242\334\002\376\257\266\312\004\275\047\272\161"
+  "\317\373\307\146\252\150\007\072\363\014\166\204\370\303\211\044"
+  "\352\152\076\226\017\252\223\054\126\111\121\315\301\305\111\117"
+  "\375\204\120\337\152\125\153\320\012\224\344\051\351\227\164\226"
+  "\226\243\167\254\133\133\377\125\101\223\375\147\231\074\307\312"
+  "\025\324\156\321\144\154\311\262\241\351\374\341\204\050\343\132"
+  "\271\015\176\147\117\354\312\244\360\276\027\142\261\351\321\337"
+  "\347\065\353\242\145\253\242\270\347\175\330\070\113\030\307\372"
+  "\032\215\304\335\364\062\051\265\040\260\175\312\301\356\164\140"
+  "\003\071\305\371\374\270\135\370\324\001\114\011\035\131\036\300"
+  "\162\117\245\007\311\240\317\230\166\114\140\054\175\256\026\176"
+  "\107\341\217\105\032\142\117\036\127\230\355\223\117\272\003\104"
+  "\143\041\140\146\046\004\305\376\047\174\334\033\271\156\026\057"
+  "\306\363\160\323\156\151\347\347\232\052\111\321\315\337\174\023"
+  "\363\312\044\102\310\116\217\226\052\367\344\342\144\235\146\303"
+  "\044\351\375\212\237\052\071\335\173\042\057\105\141\101\164\175"
+  "\116\366\040\134\075\125\312\233\216\262\345\362\270\274\111\017"
+  "\134\272\126\234\053\066\300\026\151\274\317\213\022\202\372\245"
+  "\134\056\000\300\152\265\346\207\272\061\135\055\257\134\100\260"
+  "\015\056\015\004\365\365\040\031\047\225\326\135\252\031\071\111"
+  "\203\032\174\102\072\013\012\135\152\125\060\201\111\074\112\235"
+  "\053\266\077\360\302\226\161\247\005\216\050\122\307\064\112\257"
+  "\131\017\354\101\031\226\200\040\353\015\303\073\007\025\376\163"
+  "\174\050\125\135\366\064\051\044\306\003\271\057\023\300\265\254"
+  "\202\110\171\241\150\207\153\266\304\336\265\052\151\123\345\215"
+  "\061\314\310\344\364\161\266\052\010\372\067\007\010\217\121\067"
+  "\325\145\350\037\066\333\165\142\233\137\016\313\202\162\354\341"
+  "\241\126\337\361\107\062\106\140\222\027\022\260\305\156\206\200"
+  "\250\251\347\040\115\157\314\114\336\030\256\134\142\270\340\242"
+  "\137\215\317\075\112\073\134\007\353\325\310\036\367\050\102\263"
+  "\302\113\311\203\254\152\157\216\314\016\140\143\172\154\150\217"
+  "\054\024\376\121\336\204\144\210\311\073\204\160\316\363\113\021"
+  "\020\142\251\251\053\372\336\220\107\152\011\064\120\063\017\154"
+  "\132\266\012\376\171\174\176\230\053\272\104\055\201\103\234\000"
+  "\171\044\000\102\211\365\157\330\321\357\065\025\146\127\212\035"
+  "\217\224\101\307\016\221\131\231\266\264\166\245\142\032\141\051"
+  "\156\110\253\076\054\322\316\007\274\270\240\131\151\131\037\367"
+  "\317\162\313\021\367\224\214\301\070\346\121\242\016\215\337\317"
+  "\015\030\221\375\156\335\113\173\037\126\056\045\222\250\236\117"
+  "\116\230\366\245\210\007\144\001\326\344\350\051\226\215\266\350"
+  "\247\231\021\121\057\167\341\307\211\336\362\244\006\124\321\120"
+  "\375\211\074\271\174\166\304\347\164\061\072\254\104\332\373\266"
+  "\042\140\031\277\366\171\245\076\003\113\105\044\327\356\053\122"
+  "\017\056\317\117\244\133\217\012\357\004\327\236\362\170\266\112"
+  "\162\171\276\042\226\145\365\013\360\167\067\363\234\212\211\146"
+  "\135\334\126\005\351\317\173\060\311\074\126\162\072\217\141\227"
+  "\311\013\074\321\176\371\103\206\266\340\004\046\070\150\356\173"
+  "\351\024\322\073\007\207\345\372\344\242\074\006\266\337\214\117"
+  "\172\253\340\063\120\052\236\260\150\252\161\334\047\145\154\204"
+  "\007\334\374\346\322\241\246\126\123\137\016\315\134\370\126\064"
+  "\230\014\107\063\172\371\246\262\211\016\364\215\064\340\046\040"
+  "\057\023\016\072\377\013\303\161\273\264\240\152\223\312\361\117"
+  "\321\253\232\217\142\054\367\156\210\066\215\203\031\346\377\371"
+  "\253\067\001\044\245\177\077\236\170\255\226\331\301\360\216\161"
+  "\343\325\267\015\025\165\322\134\113\304\030\047\372\064\176\367"
+  "\212\272\325\372\103\126\207\002\022\320\213\174\230\311\012\115"
+  "\006\037\204\336\337\116\163\345\143\005\365\012\147\343\213\040"
+  "\247\337\207\225\341\346\373\176\214\232\250\275\075\276\054\017"
+  "\270\112\220\121\205\061\111\125\256\140\212\163\033\317\234\322"
+  "\076\321\243\054\301\273\252\063\106\257\005\176\360\334\370\047"
+  "\055\134\056\243\356\004\105\331\014\150\304\366\350\143\142\020"
+  "\170\171\324\217\142\041\121\165\375\173\377\255\256\167\000\213"
+  "\144\006\122\032\322\355\166\020\361\065\073\336\306\145\004\271"
+  "\277\005\016\001\116\371\000\376\033\253\117\114\044\215\314\257"
+  ;
+
 static void
 text_init ()
 {
diff --git a/tests/bench.h b/tests/bench.h
index 4bfdbd4ec1..caac5e5ab6 100644
--- a/tests/bench.h
+++ b/tests/bench.h
@@ -60,7 +60,7 @@ timing_end (struct timings_state *ts)
                  + usage.ru_stime.tv_usec - ts->sys_start.tv_usec;
 }
 
-static void
+_GL_UNUSED static void
 timing_output (const struct timings_state *ts)
 {
   printf ("real %10.6f\n", (double)ts->real_usec / 1000000.0);
-- 
2.39.2

From 7b430a277a2443d968dac2735630ded561028987 Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Thu, 7 Sep 2023 14:51:57 -0700
Subject: [PATCH 4/7] mcel-prefer: new module

* modules/mcel-prefer: New file.
---
 ChangeLog           |  3 +++
 modules/mcel-prefer | 28 ++++++++++++++++++++++++++++
 2 files changed, 31 insertions(+)
 create mode 100644 modules/mcel-prefer

diff --git a/ChangeLog b/ChangeLog
index cbb1979acb..5c967214ed 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,8 @@
 2023-09-07  Paul Eggert  <egg...@cs.ucla.edu>
 
+	mcel-prefer: new module
+	* modules/mcel-prefer: New file.
+
 	mcel-bench-tests: new module
 	* modules/mcel-bench-tests, tests/bench-mcel.c: New files.
 	* tests/bench-multibyte.h (TEXT_LATIN_ASCII_LINE1)
diff --git a/modules/mcel-prefer b/modules/mcel-prefer
new file mode 100644
index 0000000000..5c5ac24054
--- /dev/null
+++ b/modules/mcel-prefer
@@ -0,0 +1,28 @@
+Description:
+Prefer mcel to the mbiter family.  mcel is simpler and can be faster.
+However, it does not support some obsolete encodings that are also not
+supported by glibc locales, and the caller is responsible for
+coalescing sequences of error-encoding bytes if that is desired.
+
+Files:
+
+Depends-on:
+mcel
+
+configure.ac-early:
+# Prefer mcel by default.  This can be overridden via
+# './configure GNULIB_MCEL_PREFER=no'.
+: ${GNULIB_MCEL_PREFER=yes}
+
+configure.ac:
+gl_MODULE_INDICATOR([mcel-prefer])
+
+Makefile.am:
+
+Include:
+
+License:
+LGPLv2+
+
+Maintainer:
+Paul Eggert
-- 
2.39.2

From 7219d38b5716cd25af2eb177c03948b9908e09c6 Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Thu, 7 Sep 2023 14:51:58 -0700
Subject: [PATCH 5/7] exclude: support GNULIB_MCEL_PREFER

Support mcel API for apps that prefer it.
The following changes are in effect only if GNULIB_MCEL_PREFER.
* lib/exclude.c: Include mcel.h instead of mbuiter.h.
(string_hasher_ci): Use mcel_scanz instead of mbui_init,
mbui_avail, mbui_cur, and mbui_advance.
* modules/exclude: Do not depend on mbuiter.
---
 ChangeLog       |  8 ++++++++
 lib/exclude.c   | 16 +++++++++++++++-
 modules/exclude |  2 +-
 3 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index 5c967214ed..ba49a1177b 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,13 @@
 2023-09-07  Paul Eggert  <egg...@cs.ucla.edu>
 
+	exclude: support GNULIB_MCEL_PREFER
+	Support mcel API for apps that prefer it.
+	The following changes are in effect only if GNULIB_MCEL_PREFER.
+	* lib/exclude.c: Include mcel.h instead of mbuiter.h.
+	(string_hasher_ci): Use mcel_scanz instead of mbui_init,
+	mbui_avail, mbui_cur, and mbui_advance.
+	* modules/exclude: Do not depend on mbuiter.
+
 	mcel-prefer: new module
 	* modules/mcel-prefer: New file.
 
diff --git a/lib/exclude.c b/lib/exclude.c
index d1ecaedfc6..a3479db8a6 100644
--- a/lib/exclude.c
+++ b/lib/exclude.c
@@ -36,7 +36,11 @@
 #include "filename.h"
 #include "fnmatch.h"
 #include "hash.h"
-#include "mbuiter.h"
+#if GNULIB_MCEL_PREFER
+# include "mcel.h"
+#else
+# include "mbuiter.h"
+#endif
 #include "xalloc.h"
 
 #if GNULIB_EXCLUDE_SINGLE_THREAD
@@ -204,7 +208,16 @@ string_hasher_ci (void const *data, size_t n_buckets)
   char const *p = data;
   size_t value = 0;
 
+#if GNULIB_MCEL_PREFER
+  while (*p)
+    {
+      mcel_t g = mcel_scanz (p);
+      value = value * 31 + (c32tolower (g.ch) - g.err);
+      p += g.len;
+    }
+#else
   mbui_iterator_t iter;
+
   for (mbui_init (iter, p); mbui_avail (iter); mbui_advance (iter))
     {
       mbchar_t m = mbui_cur (iter);
@@ -217,6 +230,7 @@ string_hasher_ci (void const *data, size_t n_buckets)
 
       value = value * 31 + wc;
     }
+#endif
 
   return value % n_buckets;
 }
diff --git a/modules/exclude b/modules/exclude
index 8adae5400f..92f8d3c472 100644
--- a/modules/exclude
+++ b/modules/exclude
@@ -13,7 +13,7 @@ fnmatch
 fopen-gnu
 hash
 mbscasecmp
-mbuiter
+mbuiter               [test "$GNULIB_MCEL_PREFER" != yes]
 nullptr
 regex
 stdbool
-- 
2.39.2

From 6887c63276334e1ca7875c431eef503553527f17 Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Thu, 7 Sep 2023 14:51:58 -0700
Subject: [PATCH 6/7] mbscasecmp: support GNULIB_MCEL_PREFER

* lib/mbscasecmp.c: Include stdlib.h, since we use MB_CUR_MAX.
Include uchar.h, for c32tolower.
(GNULIB_MCEL_PREFER): Include mcel.h instead of mbuiterf.h.
(mbscasecmp) [GNULIB_MCEL_PREFER]: Use mcel instead of mbuiterf.
* modules/mbscasecmp (Depends-on): Add c32tolower, stdlib, uchar.
Depend on mbuiterf only if not preferring mcel.
---
 ChangeLog          |  8 ++++++++
 lib/mbscasecmp.c   | 19 ++++++++++++++++++-
 modules/mbscasecmp |  5 ++++-
 3 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index ba49a1177b..2aec64130e 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,13 @@
 2023-09-07  Paul Eggert  <egg...@cs.ucla.edu>
 
+	mbscasecmp: support GNULIB_MCEL_PREFER
+	* lib/mbscasecmp.c: Include stdlib.h, since we use MB_CUR_MAX.
+	Include uchar.h, for c32tolower.
+	(GNULIB_MCEL_PREFER): Include mcel.h instead of mbuiterf.h.
+	(mbscasecmp) [GNULIB_MCEL_PREFER]: Use mcel instead of mbuiterf.
+	* modules/mbscasecmp (Depends-on): Add c32tolower, stdlib, uchar.
+	Depend on mbuiterf only if not preferring mcel.
+
 	exclude: support GNULIB_MCEL_PREFER
 	Support mcel API for apps that prefer it.
 	The following changes are in effect only if GNULIB_MCEL_PREFER.
diff --git a/lib/mbscasecmp.c b/lib/mbscasecmp.c
index 80dc18529d..5e0bc67dc0 100644
--- a/lib/mbscasecmp.c
+++ b/lib/mbscasecmp.c
@@ -23,8 +23,14 @@
 
 #include <ctype.h>
 #include <limits.h>
+#include <stdlib.h>
+#include <uchar.h>
 
-#include "mbuiterf.h"
+#if GNULIB_MCEL_PREFER
+# include "mcel.h"
+#else
+# include "mbuiterf.h"
+#endif
 
 /* Compare the character strings S1 and S2, ignoring case, returning less than,
    equal to or greater than zero if S1 is lexicographically less than, equal to
@@ -45,6 +51,16 @@ mbscasecmp (const char *s1, const char *s2)
      most often already in the very few first characters.  */
   if (MB_CUR_MAX > 1)
     {
+#if GNULIB_MCEL_PREFER
+      while (true)
+        {
+          mcel_t g1 = mcel_scanz (iter1); iter1 += g1.len;
+          mcel_t g2 = mcel_scanz (iter2); iter2 += g2.len;
+          int cmp = mcel_tocmp (c32tolower, g1, g2);
+          if (cmp | !g1.ch)
+            return cmp;
+        }
+#else
       mbuif_state_t state1;
       mbuif_init (state1);
 
@@ -70,6 +86,7 @@ mbscasecmp (const char *s1, const char *s2)
         /* s1 terminated before s2.  */
         return -1;
       return 0;
+#endif
     }
   else
     for (;;)
diff --git a/modules/mbscasecmp b/modules/mbscasecmp
index 234b7bc7a3..2fd8f1f4ce 100644
--- a/modules/mbscasecmp
+++ b/modules/mbscasecmp
@@ -5,8 +5,11 @@ Files:
 lib/mbscasecmp.c
 
 Depends-on:
-mbuiterf
+c32tolower
+mbuiterf            [test "$GNULIB_MCEL_PREFER" != yes]
+stdlib
 string
+uchar
 
 configure.ac:
 gl_STRING_MODULE_INDICATOR([mbscasecmp])
-- 
2.39.2

From 88998ee4c04e0458620bf25b9fb251568d9e2eaa Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Thu, 7 Sep 2023 14:51:58 -0700
Subject: [PATCH 7/7] Update strings doc

* doc/strings.texi: Mention mbiterf, mbuiterf, mcel, and mcel-prefer.
---
 doc/strings.texi | 88 +++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 69 insertions(+), 19 deletions(-)

diff --git a/doc/strings.texi b/doc/strings.texi
index 32165ac403..5c901493cc 100644
--- a/doc/strings.texi
+++ b/doc/strings.texi
@@ -26,6 +26,7 @@ in memory of a running C program.
 
 @menu
 * C strings::
+* Iterating through strings::
 * Strings with NUL characters::
 * String Functions in C Locale::
 * Comparison of string APIs::
@@ -176,35 +177,84 @@ Gnulib has functions @func{mbscasecmp}, @func{mbsncasecmp},
 function @code{ulc_casecmp} is preferable to these functions.
 @end itemize
 
-Gnulib also has additional API.
+@cartouche
+@emph{A C string can contain encoding errors.}
+@end cartouche
 
-@menu
-* Iterating through strings::
-@end menu
+Not every NUL-terminated byte sequence represents a valid multibyte
+string.  Byte sequences can contain encoding errors, that is, bytes or
+byte sequences that are invalid and do not represent characters.
+
+String functions like @code{mbscasecmp} and @code{strcoll} whose
+behavior depends on encoding have unspecified behavior on strings
+containing encoding errors, unless the behavior is specifically
+documented.  If an application needs a particular behavior on these
+strings it can iterate through them itself, as described in the next
+subsection.
 
 @node Iterating through strings
-@subsubsection Iterating through strings
+@subsection Iterating through strings
 
-For complex string processing, the provided string functions may not be
-enough, and what you need is a way to iterate through a string while
-processing each (possibly multibyte) character in turn.  Gnulib provides
-two modules for this purpose.  Both iterate through the string in
-forward direction.  Iteration in backward direction, that is, from the
-string's end to start, is not provided, as it is too hairy in general.
+For complex string processing, string functions may not be enough, and
+you need to iterate through a string while processing each (possibly
+multibyte) character or encoding error in turn.  Gnulib has several
+modules for iterating forward through a string in this way.  Backward
+iteration, that is, from the string's end to start, is not provided,
+as it is too hairy in general.
 
 @itemize
 @item
-The @code{mbiter} module.  It iterates through a C string whose length
-is already known.
+The @code{mbiter} module iterates through a string whose length
+is already known.  The string can contain NULs and encoding errors.
+@item
+The @code{mbiterf} module is like @code{mbiter}
+except it is more complex and typically faster.
+@item
+The @code{mbuiter} module iterates through a C string whose length
+is not a-priori known.  The string can contain encoding errors and is
+terminated by the first NUL.
+@item
+The @code{mbuiterf} module is like @code{mbuiter}
+except it is more complex and typically faster.
+@item
+The @code{mcel} module is simpler than @code{mbiter} and @code{mbuiter}
+and can be faster than even @code{mbiterf} and @code{mbuiterf}.
+It can iterate through either strings whose length is known, or
+C strings, or strings terminated by other ASCII characters < 0x30.
 @item
-The @code{mbuiter} module.  It iterates through a C string whose length
-is not a-priori known.
+The @code{mcel-prefer} module is like @code{mcel} except that it
+causes some other modules to be based on @code{mcel} instead of
+on the @code{mbiter} family.
 @end itemize
 
-The @code{mbuiter} module is suitable when there is a high probability
-that only the first few multibyte characters need to be inspected.
-Whereas the @code{mbiter} module is better if usually the iteration runs
-through the entire string.
+The choice of modules depends on the application's needs.  The
+@code{mbiter} module family is more suitable for applications that
+treat some sequences of two or more bytes as a single encoding error,
+and for applications that need to support obsolescent encodings on
+non-GNU platforms, such as CP864, EBCDIC, Johab, and Shift JIS.
+In this module family, @code{mbuiter} and @code{mbuiterf} are more
+suitable than @code{mbiter} and @code{mbiterf} when arguments are C strings,
+lengths are not already known, and it is highly likely that only the
+first few multibyte characters need to be inspected.
+
+The @code{mcel} module is simpler and can be faster than the
+@code{mbiter} family, and is more suitable for applications that do
+not need the @code{mbiter} family's special features.
+
+The @code{mcel-prefer} module is like @code{mcel} except that it also
+causes some other modules, such as @code{mbscasecmp}, to use
+@code{mcel} rather than the @code{mbiter} family.  This can be simpler
+and faster.  However, it does not support the obsolescent encodings,
+and it may behave differently on data containing encoding errors where
+behavior is unspecified or undefined, because in @code{mcel} each
+encoding error is a single byte whereas in the @code{mbiter} family a
+single encoding error can contain two or more bytes.
+
+If a package uses @code{mcel-prefer}, it may also want to give
+@command{gnulib-tool} one or more of the options
+@option{--avoid=mbiter}, @option{--avoid=mbiterf},
+@option{--avoid=mbuiter} and @option{--avoid=mbuiterf},
+to avoid packaging modules that are not needed.
 
 @node Strings with NUL characters
 @subsection Strings with NUL characters
-- 
2.39.2

Re: mbcel module for Gnulib?, incomplete multibyte sequences

Reply via email to