On 2023-08-07 00:32, Paul Eggert wrote:
I'll think about naming.
It's been a month and I couldn't think of anything better than to
shorten the name from mbcel to mcel, so I did that and installed the
attached patches. These patches shouldn't affect behavior; that is, they
should only add new functionality.
With luck this should be enough for diffutils to stop using its own
homegrown variants of Gnulib modules; I plan to look into that next.
I'm not entirely happy with this approach, as it means packages like
diffutils will need to pass --avoid=mbuiterf etc. to gnulib-tool if the
packages prefer mcel for everything. If gnulib-tool gave us a way to say
that mbscasecmp depends on mbuiterf or mbcel (i.e., "or" instead of
"and") perhaps we could do something better. But the patch should work
as-is for diffutils, and if we come up with something better for Gnulib
we can improve diffutils accordingly.
Although these patches update Gnulib's 'exclude' and 'mbscasecmp'
modules to support mcel-prefer, they don't have similar updates for
other modules like 'mbsncasecmp' and 'propername' that could also use
support. Diffutils doesn't use those other modules so I left them alone
for now; they can be updated later as needed (and by then maybe we'll
have a better solution for the --avoid problem).From b93de66735cd6f935ee0970f8cb26908d113e09d Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Thu, 7 Sep 2023 14:51:55 -0700
Subject: [PATCH 1/7] mcel: new module
* lib/mcel.c, lib/mcel.h, modules/mcel: New files.
---
ChangeLog | 5 +
lib/mcel.c | 3 +
lib/mcel.h | 294 +++++++++++++++++++++++++++++++++++++++++++++++++++
modules/mcel | 34 ++++++
4 files changed, 336 insertions(+)
create mode 100644 lib/mcel.c
create mode 100644 lib/mcel.h
create mode 100644 modules/mcel
diff --git a/ChangeLog b/ChangeLog
index d5fc6c2130..d477347b91 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,8 @@
+2023-09-07 Paul Eggert <egg...@cs.ucla.edu>
+
+ mcel: new module
+ * lib/mcel.c, lib/mcel.h, modules/mcel: New files.
+
2023-09-07 Bruno Haible <br...@clisp.org>
Don't use 'throw ()' in C++ 11 or newer.
diff --git a/lib/mcel.c b/lib/mcel.c
new file mode 100644
index 0000000000..3c2ae46290
--- /dev/null
+++ b/lib/mcel.c
@@ -0,0 +1,3 @@
+#include <config.h>
+#define MCEL_INLINE _GL_EXTERN_INLINE
+#include "mcel.h"
diff --git a/lib/mcel.h b/lib/mcel.h
new file mode 100644
index 0000000000..400604f8b2
--- /dev/null
+++ b/lib/mcel.h
@@ -0,0 +1,294 @@
+/* Multi-byte characters, Error encodings, and Lengths (MCELs)
+ Copyright 2023 Free Software Foundation, Inc.
+
+ This file is free software: you can redistribute it and/or modify
+ it under the terms of the GNU Lesser General Public License as
+ published by the Free Software Foundation; either version 2.1 of the
+ License, or (at your option) any later version.
+
+ This file is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU Lesser General Public License for more details.
+
+ You should have received a copy of the GNU Lesser General Public License
+ along with this program. If not, see <https://www.gnu.org/licenses/>. */
+
+/* Written by Paul Eggert. */
+
+/* The macros in this file implement multi-byte character representation
+ and forward iteration through a multi-byte string.
+ They are simpler and can be faster than the mbiter family.
+ However, they do not support obsolescent encodings like CP864,
+ EBCDIC, Johab, and Shift JIS that glibc also does not support,
+ and it is up to the caller to coalesce encoding-error bytes if desired.
+
+ The mcel_scan function lets code iterate through an array of bytes,
+ supporting character encodings in practical use
+ more simply than using plain mbrtoc32.
+
+ Instead of this single-byte code:
+
+ char *p = ..., *lim = ...;
+ for (; p < lim; p++)
+ process (*p);
+
+ You can use this multi-byte code:
+
+ char *p = ..., *lim = ...;
+ for (mcel_t g; p < lim; p += g.len)
+ {
+ g = mcel_scan (p, lim);
+ process (g);
+ }
+
+ You can select from G using G.ch, G.err, and G.len.
+ G is an encoding error if G.err is nonzero, a character otherwise.
+
+ The mcel_scanz function is similar except it works with a
+ string of unknown but positive length that is terminated with '\0'.
+ Instead of this single-byte code:
+
+ char *p = ...;
+ for (; *p; p++)
+ process (*p);
+
+ You can use this multi-byte code:
+
+ char *p = ...;
+ for (mcel_t g; *p; p += g.len)
+ {
+ g = mcel_scanz (p);
+ process (g);
+ }
+
+ mcel_scant (P, TERMINATOR) is like mcel_scanz (P) except the
+ string is terminated by TERMINATOR. The C standard says that the
+ TERMINATORs '\0', '\r', '\n', '.', '/' are safe, as they cannot be
+ a part (even a trailing byte) of a multi-byte character.
+ In practice TERMINATOR is safe if 0 <= TERMINATOR <= 0x2f (ASCII '/').
+
+ mcel_ch (CH, LEN) and mcel_err (ERR) construct mcel_t values.
+
+ mcel_cmp (G1, G2) compares two mcel_t values lexicographically by
+ character or by encoding byte value, with encoding bytes sorting
+ after characters.
+
+ Calls like c32isalpha (G.ch) test G; they return false for encoding
+ errors since calls like c32isalpha (0) return false. Calls like
+ mcel_tocmp (c32tolower, G1, G2) are like mcel_cmp (G1, G2),
+ but transliterate first.
+
+ Although ISO C and POSIX allow encodings that have shift states or
+ that can produce multiple characters from an indivisible byte sequence,
+ POSIX does not require support for these encodings,
+ they are not in practical use on GNUish platforms,
+ and omitting support for them simplifies the API. */
+
+#ifndef _MCEL_H
+#define _MCEL_H 1
+
+#if !_GL_CONFIG_H_INCLUDED
+ #error "Please include config.h first."
+#endif
+
+#include <verify.h>
+
+#include <limits.h>
+#include <stddef.h>
+#include <uchar.h>
+
+/* Pacify GCC re type limits. */
+#if defined __GNUC__ && 4 < __GNUC__ + (3 <= __GNUC_MINOR__)
+# pragma GCC diagnostic ignored "-Wtype-limits"
+#endif
+
+/* The maximum multi-byte character length supported on any platform.
+ This can be less than MB_LEN_MAX because many platforms have a
+ large MB_LEN_MAX to allow for stateful encodings, and mcel does not
+ support these encodings. MCEL_LEN_MAX is enough for UTF-8, EUC,
+ Shift-JIS, GB18030, etc. In all multi-byte encodings supported by glibc,
+ 0 < MB_CUR_MAX <= MCEL_LEN_MAX <= MB_LEN_MAX. */
+enum { MCEL_LEN_MAX = MB_LEN_MAX < 4 ? MB_LEN_MAX : 4 };
+
+/* Bounds for mcel_t members. */
+enum { MCEL_CHAR_MAX = 0x10FFFF };
+enum { MCEL_ERR_MIN = 0x80 };
+
+/* mcel_t is a type representing a character CH or an encoding error byte ERR,
+ along with a count of the LEN bytes that represent CH or ERR.
+ If ERR is zero, CH is a valid character and 0 < LEN <= MCEL_LEN_MAX;
+ otherwise ERR is an encoding error byte, MCEL_ERR_MIN <= ERR,
+ CH == 0, and LEN == 1. */
+typedef struct
+{
+ char32_t ch;
+ unsigned char err;
+ unsigned char len;
+} mcel_t;
+
+/* Every multi-byte character length fits in mcel_t's LEN. */
+static_assert (MB_LEN_MAX <= UCHAR_MAX);
+
+/* Shifting an encoding error byte left by this value
+ suffices to sort encoding errors after characters. */
+enum { MCEL_ERR_SHIFT = 14 };
+static_assert (MCEL_CHAR_MAX < MCEL_ERR_MIN << MCEL_ERR_SHIFT);
+
+/* Unsigned char promotes to int. */
+static_assert (UCHAR_MAX <= INT_MAX);
+
+/* Bytes have 8 bits, as POSIX requires. */
+static_assert (CHAR_BIT == 8);
+
+#ifndef _GL_LIKELY
+/* Rely on __builtin_expect, as provided by the module 'builtin-expect'. */
+# define _GL_LIKELY(cond) __builtin_expect ((cond), 1)
+# define _GL_UNLIKELY(cond) __builtin_expect ((cond), 0)
+#endif
+
+_GL_INLINE_HEADER_BEGIN
+#ifndef MCEL_INLINE
+# define MCEL_INLINE _GL_INLINE
+#endif
+
+/* mcel_t constructors. */
+MCEL_INLINE mcel_t
+mcel_ch (char32_t ch, size_t len)
+{
+ assume (0 < len);
+ assume (len <= MCEL_LEN_MAX);
+ assume (ch <= MCEL_CHAR_MAX);
+ return (mcel_t) {.ch = ch, .len = len};
+}
+MCEL_INLINE mcel_t
+mcel_err (unsigned char err)
+{
+ assume (MCEL_ERR_MIN <= err);
+ return (mcel_t) {.err = err, .len = 1};
+}
+
+/* Compare C1 and C2, with encoding errors sorting after characters.
+ Return <0, 0, >0 for <, =, >. */
+MCEL_INLINE int
+mcel_cmp (mcel_t c1, mcel_t c2)
+{
+ int ch1 = c1.ch, ch2 = c2.ch;
+ return ((c1.err - c2.err) * (1 << MCEL_ERR_SHIFT)) + (ch1 - ch2);
+}
+
+/* Apply the uchar translator TO to C1 and C2 and compare the results,
+ with encoding errors sorting after characters,
+ Return <0, 0, >0 for <, =, >. */
+MCEL_INLINE int
+mcel_tocmp (wint_t (*to) (wint_t), mcel_t c1, mcel_t c2)
+{
+ int cmp = mcel_cmp (c1, c2);
+ if (_GL_LIKELY ((c1.err - c2.err) | !cmp))
+ return cmp;
+ int ch1 = to (c1.ch), ch2 = to (c2.ch);
+ return ch1 - ch2;
+}
+
+/* Whether C represents itself as a Unicode character
+ when it is the first byte of a single- or multi-byte character.
+ These days it is safe to assume ASCII, so do not support
+ obsolescent encodings like CP864, EBCDIC, Johab, and Shift JIS. */
+MCEL_INLINE bool
+mcel_isbasic (char c)
+{
+ return _GL_LIKELY (0 <= c && c < MCEL_ERR_MIN);
+}
+
+/* With mcel there should be no need for the performance overhead of
+ replacing glibc mbrtoc32, as callers shouldn't care whether the
+ C locale treats a byte with the high bit set as an encoding error. */
+#ifdef __GLIBC__
+# undef mbrtoc32
+#endif
+
+/* Scan bytes from P inclusive to LIM exclusive. P must be less than LIM.
+ Return the character or encoding error starting at P. */
+MCEL_INLINE mcel_t
+mcel_scan (char const *p, char const *lim)
+{
+ /* Handle ASCII quickly to avoid the overhead of calling mbrtoc32.
+ In supported encodings, the first byte of a multi-byte character
+ cannot be an ASCII byte. */
+ char c = *p;
+ if (mcel_isbasic (c))
+ return mcel_ch (c, 1);
+
+ /* An initial mbstate_t; initialization optimized for some platforms.
+ For details about these and other platforms, see wchar.in.h. */
+#if defined __GLIBC__ && 2 < __GLIBC__ + (2 <= __GLIBC_MINOR__)
+ /* Although only a trivial optimization, it's worth it for GNU. */
+ mbstate_t mbs; mbs.__count = 0;
+#elif (defined __FreeBSD__ || defined __DragonFly__ || defined __OpenBSD__ \
+ || (defined __APPLE__ && defined __MACH__))
+ /* These platforms have 128-byte mbstate_t. What were they thinking?
+ Initialize just for supported encodings (UTF-8, EUC, etc.).
+ Avoid memset because some compilers generate function call code. */
+ struct mbhidden { char32_t ch; int utf8_want, euc_want; }
+ _GL_ATTRIBUTE_MAY_ALIAS;
+ union { mbstate_t m; struct mbhidden s; } u;
+ u.s.ch = u.s.utf8_want = u.s.euc_want = 0;
+# define mbs u.m
+#elif defined __NetBSD__
+ /* Experiments on both 32- and 64-bit NetBSD platforms have
+ shown that it doesn't work to clear fewer than 24 bytes. */
+ struct mbhidden { long long int a, b, c; } _GL_ATTRIBUTE_MAY_ALIAS;
+ union { mbstate_t m; struct mbhidden s; } u;
+ u.s.a = u.s.b = u.s.c = 0;
+# define mbs u.m
+#else
+ /* mbstate_t has unknown structure or is not worth optimizing. */
+ mbstate_t mbs = {0};
+#endif
+
+ char32_t ch;
+ size_t len = mbrtoc32 (&ch, p, lim - p, &mbs);
+
+ /* Any LEN with top bit set is an encoding error, as LEN == (size_t) -3
+ is not supported and MB_LEN_MAX is small. */
+ if (_GL_UNLIKELY ((size_t) -1 / 2 < len))
+ return mcel_err (c);
+
+ /* A multi-byte character. LEN must be positive,
+ as *P != '\0' and shift sequences are not supported. */
+ return mcel_ch (ch, len);
+}
+
+/* Scan bytes from P, a byte sequence terminated by TERMINATOR.
+ If *P == TERMINATOR, scan just that byte; otherwise scan
+ bytes up to but not including TERMINATOR.
+ TERMINATOR must be ASCII, and should be '\0', '\r', '\n', '.', or '/'.
+ Return the character or encoding error starting at P. */
+MCEL_INLINE mcel_t
+mcel_scant (char const *p, char terminator)
+{
+ /* Handle ASCII quickly for speed. */
+ if (mcel_isbasic (*p))
+ return mcel_ch (*p, 1);
+
+ /* Defer to mcel_scan for non-ASCII. Compute length with code that
+ is typically faster than strnlen. */
+ char const *lim = p + 1;
+ for (int i = 0; i < MCEL_LEN_MAX - 1; i++)
+ lim += *lim != terminator;
+ return mcel_scan (p, lim);
+}
+
+/* Scan bytes from P, a byte sequence terminated by '\0'.
+ If *P == '\0', scan just that byte; otherwise scan
+ bytes up to but not including '\0'.
+ Return the character or encoding error starting at P. */
+MCEL_INLINE mcel_t
+mcel_scanz (char const *p)
+{
+ return mcel_scant (p, '\0');
+}
+
+_GL_INLINE_HEADER_END
+
+#endif /* _MCEL_H */
diff --git a/modules/mcel b/modules/mcel
new file mode 100644
index 0000000000..59ca633641
--- /dev/null
+++ b/modules/mcel
@@ -0,0 +1,34 @@
+Description:
+Multibye Characters, Encoding errors, and Lengths
+
+Files:
+lib/mcel.c
+lib/mcel.h
+
+Depends-on:
+assert-h
+extern-inline
+limits-h
+mbrtoc32
+stdbool
+uchar
+verify
+
+configure.ac:
+
+Makefile.am:
+lib_SOURCES += mcel.c mcel.h
+
+Include:
+"mcel.h"
+
+Link:
+$(LTLIBUNISTRING) when linking with libtool, $(LIBUNISTRING) otherwise
+$(MBRTOWC_LIB)
+$(LTLIBC32CONV) when linking with libtool, $(LIBC32CONV) otherwise
+
+License:
+LGPLv2+
+
+Maintainer:
+all
--
2.39.2
From 988b7b2f88972e26d7b828c3f0925d50c2fb354e Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Thu, 7 Sep 2023 14:51:56 -0700
Subject: [PATCH 2/7] mcel-tests: new module
* modules/mcel-tests, tests/test-mcel.c: New files
---
ChangeLog | 3 +
modules/mcel-tests | 12 ++++
tests/test-mcel.c | 137 +++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 152 insertions(+)
create mode 100644 modules/mcel-tests
create mode 100644 tests/test-mcel.c
diff --git a/ChangeLog b/ChangeLog
index d477347b91..1b10dda6a9 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,8 @@
2023-09-07 Paul Eggert <egg...@cs.ucla.edu>
+ mcel-tests: new module
+ * modules/mcel-tests, tests/test-mcel.c: New files
+
mcel: new module
* lib/mcel.c, lib/mcel.h, modules/mcel: New files.
diff --git a/modules/mcel-tests b/modules/mcel-tests
new file mode 100644
index 0000000000..4b9ba0eeaf
--- /dev/null
+++ b/modules/mcel-tests
@@ -0,0 +1,12 @@
+Files:
+tests/test-mcel.c
+
+Depends-on:
+assert-h
+setlocale
+
+configure.ac:
+
+Makefile.am:
+TESTS += test-mcel
+check_PROGRAMS += test-mcel
diff --git a/tests/test-mcel.c b/tests/test-mcel.c
new file mode 100644
index 0000000000..2977ec06a0
--- /dev/null
+++ b/tests/test-mcel.c
@@ -0,0 +1,137 @@
+/* Test <mcel.h>
+ Copyright 2023 Free Software Foundation, Inc.
+
+ This program is free software: you can redistribute it and/or modify
+ it under the terms of the GNU General Public License as published by
+ the Free Software Foundation, either version 3 of the License, or
+ (at your option) any later version.
+
+ This program is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU General Public License for more details.
+
+ You should have received a copy of the GNU General Public License
+ along with this program. If not, see <https://www.gnu.org/licenses/>. */
+
+#include <config.h>
+
+#include <mcel.h>
+
+#include <locale.h>
+
+#include "macros.h"
+
+static wint_t
+to_ascii (wint_t c)
+{
+ return c & 0x7f;
+}
+
+static int
+sgn (int i)
+{
+ return (i > 0) - (i < 0);
+}
+
+static void
+test_mcel_vs_mbrtoc32 (unsigned char uc, mcel_t c, size_t n, char32_t ch)
+{
+ ASSERT (!c.err == (n <= MB_LEN_MAX));
+ ASSERT (c.err
+ ? c.err == uc && c.ch == 0 && c.len == 1
+ : c.ch == ch && c.len == (n ? n : 1));
+}
+
+int
+main (void)
+{
+ /* configure should already have checked that the locale is supported. */
+ if (setlocale (LC_ALL, "") == NULL)
+ return 1;
+
+ mcel_t prev;
+ for (int ch = 0; ch < 0x80; ch++)
+ {
+ mcel_t c = mcel_ch (ch, 1);
+ ASSERT (c.ch == ch);
+ ASSERT (c.len == 1);
+ ASSERT (!c.err);
+ ASSERT (mcel_cmp (c, c) == 0);
+ ASSERT (mcel_tocmp (to_ascii, c, c) == 0);
+ if (ch)
+ {
+ ASSERT (mcel_cmp (prev, c) < 0);
+ ASSERT (mcel_cmp (c, prev) > 0);
+ ASSERT (mcel_tocmp (to_ascii, prev, c) < 0);
+ ASSERT (mcel_tocmp (to_ascii, c, prev) > 0);
+ }
+ ASSERT (mcel_isbasic (ch));
+ prev = c;
+ }
+ for (char ch = CHAR_MIN; ; ch++)
+ {
+ ASSERT (mcel_isbasic (ch) == (0 <= ch && ch <= 0x7f));
+ if (ch == CHAR_MAX)
+ break;
+ }
+ for (int ch = 0x80; ch < 0x200; ch++)
+ {
+ mcel_t c = mcel_ch (ch, 2);
+ ASSERT (c.ch == ch);
+ ASSERT (c.len == 2);
+ ASSERT (!c.err);
+ ASSERT (mcel_cmp (c, c) == 0);
+ ASSERT (mcel_tocmp (to_ascii, c, c) == 0);
+ ASSERT (mcel_cmp (prev, c) < 0);
+ ASSERT (mcel_cmp (c, prev) > 0);
+ ASSERT (mcel_tocmp (to_ascii, c, c) == 0);
+ int cmp = to_ascii (c.ch) ? -1 : 1;
+ ASSERT (sgn (mcel_tocmp (to_ascii, prev, c)) == cmp);
+ ASSERT (sgn (mcel_tocmp (to_ascii, c, prev)) == -cmp);
+ prev = c;
+ }
+ for (unsigned char err = 0x80; ; err++)
+ {
+ mcel_t c = mcel_err (err);
+ ASSERT (!c.ch);
+ ASSERT (c.len == 1);
+ ASSERT (c.err == err);
+ ASSERT (mcel_cmp (c, c) == 0);
+ ASSERT (mcel_cmp (prev, c) < 0);
+ ASSERT (mcel_cmp (c, prev) > 0);
+ ASSERT (mcel_tocmp (to_ascii, c, c) == 0);
+ ASSERT (mcel_tocmp (to_ascii, prev, c) < 0);
+ ASSERT (mcel_tocmp (to_ascii, c, prev) > 0);
+ prev = c;
+ if (err == (unsigned char) -1)
+ break;
+ }
+
+ for (int i = CHAR_MIN; i <= CHAR_MAX; i++)
+ for (int j = CHAR_MIN; i <= CHAR_MAX; i++)
+ for (int k = CHAR_MIN; k <= CHAR_MAX; k++)
+ {
+ char const ijk[] = {i, j, k};
+ mbstate_t mbs = {0};
+ char32_t ch;
+ size_t n = mbrtoc32 (&ch, ijk, sizeof ijk, &mbs);
+ mcel_t c = mcel_scan (ijk, ijk + sizeof ijk);
+ test_mcel_vs_mbrtoc32 (i, c, n, ch);
+
+ static char const terminator[] = "\r\n./";
+ for (int ti = 0; ti < sizeof terminator; ti++)
+ {
+ char t = terminator[ti];
+ if (i == t)
+ continue;
+ mcel_t d = mcel_scant (ijk, t);
+ ASSERT (c.ch == d.ch && c.err == d.err && c.len == d.len);
+ if (!t)
+ {
+ mcel_t z = mcel_scanz (ijk);
+ ASSERT (d.ch == z.ch && d.err == z.err && d.len == z.len);
+ }
+ }
+ }
+}
--
2.39.2
From d9ad9a68fd418286bcaf0b4c71c3ae2fc63a09c5 Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Thu, 7 Sep 2023 14:51:56 -0700
Subject: [PATCH 3/7] mcel-bench-tests: new module
* modules/mcel-bench-tests, tests/bench-mcel.c: New files.
* tests/bench-multibyte.h (TEXT_LATIN_ASCII_LINE1)
(TEXT_FRENCH_UTF8_LINE1, TEXT_GREEK_UTF8_LINE1)
(TEXT_CHINESE_UTF8_LINE1): New macros.
(text_random_bytes): New constant.
* tests/bench.h (timing_output): Mark with _GL_UNUSED,
since bench-mcel.c does not use it.
---
ChangeLog | 9 +
modules/mcel-bench-tests | 23 +++
tests/bench-mcel.c | 369 +++++++++++++++++++++++++++++++++++++++
tests/bench-multibyte.h | 139 +++++++++++++++
tests/bench.h | 2 +-
5 files changed, 541 insertions(+), 1 deletion(-)
create mode 100644 modules/mcel-bench-tests
create mode 100644 tests/bench-mcel.c
diff --git a/ChangeLog b/ChangeLog
index 1b10dda6a9..cbb1979acb 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,14 @@
2023-09-07 Paul Eggert <egg...@cs.ucla.edu>
+ mcel-bench-tests: new module
+ * modules/mcel-bench-tests, tests/bench-mcel.c: New files.
+ * tests/bench-multibyte.h (TEXT_LATIN_ASCII_LINE1)
+ (TEXT_FRENCH_UTF8_LINE1, TEXT_GREEK_UTF8_LINE1)
+ (TEXT_CHINESE_UTF8_LINE1): New macros.
+ (text_random_bytes): New constant.
+ * tests/bench.h (timing_output): Mark with _GL_UNUSED,
+ since bench-mcel.c does not use it.
+
mcel-tests: new module
* modules/mcel-tests, tests/test-mcel.c: New files
diff --git a/modules/mcel-bench-tests b/modules/mcel-bench-tests
new file mode 100644
index 0000000000..ea64a2f60c
--- /dev/null
+++ b/modules/mcel-bench-tests
@@ -0,0 +1,23 @@
+Files:
+tests/bench-mcel.c
+tests/bench-multibyte.h
+tests/bench.h
+
+Depends-on:
+mbiter
+mbiterf
+mbrtoc32-regular
+mbuiter
+mbuiterf
+mcel
+setlocale
+striconv
+getrusage
+gettimeofday
+
+configure.ac:
+
+Makefile.am:
+noinst_PROGRAMS += bench-mcel
+bench_mcel_CPPFLAGS = $(AM_CPPFLAGS) -DNDEBUG
+bench_mcel_LDADD = $(LDADD) $(LIBUNISTRING) $(SETLOCALE_LIB) $(MBRTOWC_LIB) $(LIBC32CONV)
diff --git a/tests/bench-mcel.c b/tests/bench-mcel.c
new file mode 100644
index 0000000000..3fbfe122c9
--- /dev/null
+++ b/tests/bench-mcel.c
@@ -0,0 +1,369 @@
+/* Benchmark mcel and some alternatives
+ Copyright 2023 Free Software Foundation, Inc.
+
+ This program is free software: you can redistribute it and/or modify
+ it under the terms of the GNU General Public License as published by
+ the Free Software Foundation, either version 3 of the License, or
+ (at your option) any later version.
+
+ This program is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU General Public License for more details.
+
+ You should have received a copy of the GNU General Public License
+ along with this program. If not, see <https://www.gnu.org/licenses/>. */
+
+#include <config.h>
+
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>
+#include <locale.h>
+#include <uchar.h>
+
+#include "bench.h"
+#include "bench-multibyte.h"
+#include "mbiter.h"
+#include "mbiterf.h"
+#include "mbuiter.h"
+#include "mbuiterf.h"
+#include "mcel.h"
+
+typedef unsigned long long (*test_function) (char const *, char const *, int);
+
+static unsigned long long
+noop_test (char const *text, char const *text_end, int repeat)
+{
+ unsigned long long sum = 0;
+
+ for (int count = 0; count < repeat; count++)
+ {
+ char const *iter;
+ for (iter = text; iter < text_end; iter++)
+ sum += (uintptr_t) iter;
+ }
+
+ return sum;
+}
+
+static unsigned long long
+single_test (char const *text, char const *text_end, int repeat)
+{
+ unsigned long long sum = 0;
+
+ for (int count = 0; count < repeat; count++)
+ for (char const *iter = text; iter < text_end; )
+ {
+ unsigned char c = *iter++;
+ sum += c;
+ }
+
+ return sum;
+}
+
+static unsigned long long
+mbiter_test (char const *text, char const *text_end, int repeat)
+{
+ unsigned long long sum = 0;
+
+ size_t text_len = text_end - text;
+ for (int count = 0; count < repeat; count++)
+ {
+ mbi_iterator_t iter;
+ for (mbi_init (iter, text, text_len); mbi_avail (iter); )
+ {
+ mbchar_t cur = mbi_cur (iter);
+ mbi_advance (iter);
+ sum += cur.wc_valid ? cur.wc : (unsigned char) *mb_ptr (cur) << 16;
+ }
+ }
+
+ return sum;
+}
+
+static unsigned long long
+mbiterf_test (char const *text, char const *text_end, int repeat)
+{
+ unsigned long long sum = 0;
+
+ for (int count = 0; count < repeat; count++)
+ {
+ mbif_state_t state;
+ char const *iter;
+ for (mbif_init (state), iter = text; mbif_avail (state, iter, text_end); )
+ {
+ mbchar_t cur = mbif_next (state, iter, text_end);
+ iter += mb_len (cur);
+ sum += cur.wc_valid ? cur.wc : (unsigned char) *mb_ptr (cur) << 16;
+ }
+ }
+
+ return sum;
+}
+
+static unsigned long long
+mbuiter_test (char const *text, char const *text_end, int repeat)
+{
+ unsigned long long sum = 0;
+
+ for (int count = 0; count < repeat; count++)
+ for (char const *t = text; t < text_end; t++)
+ {
+ mbui_iterator_t iter;
+ for (mbui_init (iter, t); mbui_avail (iter); )
+ {
+ mbchar_t cur = mbui_cur (iter);
+ mbui_advance (iter);
+ sum += cur.wc_valid ? cur.wc : (unsigned char) *mb_ptr (cur) << 16;
+ }
+ t = mbui_cur_ptr (iter);
+ }
+
+ return sum;
+}
+
+static unsigned long long
+mbuiterf_test (char const *text, _GL_UNUSED char const *text_end, int repeat)
+{
+ unsigned long long sum = 0;
+
+ for (int count = 0; count < repeat; count++)
+ for (char const *t = text; t < text_end; t++)
+ {
+ mbuif_state_t state;
+ char const *iter;
+ for (mbuif_init (state), iter = t; mbuif_avail (state, iter); )
+ {
+ mbchar_t cur = mbuif_next (state, iter);
+ iter += mb_len (cur);
+ sum += cur.wc_valid ? cur.wc : (unsigned char) *mb_ptr (cur) << 16;
+ }
+ t = iter;
+ }
+
+ return sum;
+}
+
+static unsigned long long
+mcel_test (char const *text, char const *text_end, int repeat)
+{
+ unsigned long long sum = 0;
+
+ for (int count = 0; count < repeat; count++)
+ for (char const *iter = text; iter < text_end; )
+ {
+ mcel_t g = mcel_scan (iter, text_end);
+ iter += g.len;
+ sum += g.ch | (g.err << 16);
+ }
+
+ return sum;
+}
+
+static unsigned long long
+mcuel_test (char const *text, char const *text_end, int repeat)
+{
+ unsigned long long sum = 0;
+
+ for (int count = 0; count < repeat; count++)
+ for (char const *t = text; t < text_end; t++)
+ {
+ char const *iter = t;
+ while (*iter)
+ {
+ mcel_t g = mcel_scanz (iter);
+ iter += g.len;
+ sum += g.ch | (g.err << 16);
+ }
+ t = iter;
+ }
+
+ return sum;
+}
+
+static unsigned long long
+do_1_test (test_function test, char const *text,
+ char const *text_end, int repeat, struct timings_state *ts)
+{
+ timing_start (ts);
+ unsigned long long sum = test (text, text_end, repeat);
+ timing_end (ts);
+ return sum;
+}
+
+static void
+do_test (char test, int repeat, char const *locale_name,
+ char const *text, size_t text_len)
+{
+ if (setlocale (LC_ALL, locale_name) != NULL)
+ {
+ char const *text_end = text + text_len;
+
+ static struct
+ {
+ char const *name;
+ test_function fn;
+ struct timings_state ts;
+ unsigned long long volatile sum;
+ } testdesc[] = {
+ { "noop", noop_test },
+ { "single", single_test },
+ { "mbiter", mbiter_test },
+ { "mbiterf", mbiterf_test },
+ { "mbuiter", mbuiter_test },
+ { "mbuiterf", mbuiterf_test },
+ { "mcel", mcel_test },
+ { "mcuel", mcuel_test },
+ };
+ int ntestdesc = sizeof testdesc / sizeof *testdesc;
+ for (int i = 0; i < ntestdesc; i++)
+ testdesc[i].sum =
+ do_1_test (testdesc[i].fn, text, text_end, repeat, &testdesc[i].ts);
+
+ setlocale (LC_ALL, "C");
+
+ static bool header_printed;
+ if (!header_printed)
+ {
+ printf (" ");
+ for (int i = 0; i < ntestdesc; i++)
+ printf (" %8s", testdesc[i].name);
+ printf ("\n");
+ header_printed = true;
+ }
+
+ printf ("%c", test);
+ for (int i = 0; i < ntestdesc; i++)
+ {
+ double user_usec = testdesc[i].ts.user_usec;
+ double sys_usec = testdesc[i].ts.sys_usec;
+ printf (" %8.3f", (user_usec + sys_usec) / 1e6);
+ }
+ printf ("\n");
+ }
+ else
+ {
+ printf ("Skipping test: locale %s not installed.\n", locale_name);
+ }
+}
+
+/* Performs some or all of the following tests:
+ A - ASCII text, C locale
+ B - ASCII text, UTF-8 locale
+ C - French text, C locale
+ D - French text, ISO-8859-1 locale
+ E - French text, UTF-8 locale
+ F - Greek text, C locale
+ G - Greek text, ISO-8859-7 locale
+ H - Greek text, UTF-8 locale
+ I - Chinese text, UTF-8 locale
+ J - Chinese text, GB18030 locale
+ K - Random bytes, C locale
+ L - Random bytes, UTF-8 locale
+ a - short ASCII text, C locale
+ b - short ASCII text, UTF-8 locale
+ e - short French text, UTF-8 locale
+ h - short Greek text, UTF-8 locale
+ i - short Chinese text, UTF-8 locale
+ Pass the tests to be performed as first argument. */
+int
+main (int argc, char *argv[])
+{
+ if (argc != 3)
+ {
+ fprintf (stderr, "Usage: %s TESTS REPETITIONS\n", argv[0]);
+
+ fprintf (stderr, "Example: %s ABCDEFGHIJKabehi 100000\n", argv[0]);
+ exit (1);
+ }
+
+ char const *tests = argv[1];
+ int repeat = atoi (argv[2]);
+
+ text_init ();
+
+ /* Execute each test. */
+ size_t i;
+ for (i = 0; i < strlen (tests); i++)
+ {
+ char test = tests[i];
+
+ switch (test)
+ {
+ case 'A':
+ do_test (test, repeat, "C", text_latin_ascii,
+ strlen (text_latin_ascii));
+ break;
+ case 'a':
+ do_test (test, repeat, "C", TEXT_LATIN_ASCII_LINE1,
+ strlen (TEXT_LATIN_ASCII_LINE1));
+ break;
+ case 'B':
+ do_test (test, repeat, "en_US.UTF-8", text_latin_ascii,
+ strlen (text_latin_ascii));
+ break;
+ case 'b':
+ do_test (test, repeat, "en_US.UTF-8", TEXT_LATIN_ASCII_LINE1,
+ strlen (TEXT_LATIN_ASCII_LINE1));
+ break;
+ case 'C':
+ do_test (test, repeat, "C", text_french_iso8859,
+ strlen (text_french_iso8859));
+ break;
+ case 'D':
+ do_test (test, repeat, "fr_FR.ISO-8859-1", text_french_iso8859,
+ strlen (text_french_iso8859));
+ break;
+ case 'E':
+ do_test (test, repeat, "en_US.UTF-8", text_french_utf8,
+ strlen (text_french_utf8));
+ break;
+ case 'e':
+ do_test (test, repeat, "en_US.UTF-8", TEXT_FRENCH_UTF8_LINE1,
+ strlen (TEXT_FRENCH_UTF8_LINE1));
+ break;
+ case 'F':
+ do_test (test, repeat, "C", text_greek_iso8859,
+ strlen (text_greek_iso8859));
+ break;
+ case 'G':
+ do_test (test, repeat, "el_GR.ISO-8859-7", text_greek_iso8859,
+ strlen (text_greek_iso8859));
+ break;
+ case 'H':
+ do_test (test, repeat, "en_US.UTF-8", text_greek_utf8,
+ strlen (text_greek_utf8));
+ break;
+ case 'h':
+ do_test (test, repeat, "en_US.UTF-8", TEXT_GREEK_UTF8_LINE1,
+ strlen (TEXT_GREEK_UTF8_LINE1));
+ break;
+ case 'I':
+ do_test (test, repeat, "en_US.UTF-8", text_chinese_utf8,
+ strlen (text_chinese_utf8));
+ break;
+ case 'i':
+ do_test (test, repeat, "en_US.UTF-8", TEXT_CHINESE_UTF8_LINE1,
+ strlen (TEXT_CHINESE_UTF8_LINE1));
+ break;
+ case 'J':
+ do_test (test, repeat, "zh_CN.GB18030", text_chinese_gb18030,
+ strlen (text_chinese_gb18030));
+ break;
+ case 'K':
+ do_test (test, repeat, "C", text_random_bytes,
+ sizeof text_random_bytes - 1);
+ break;
+ case 'L':
+ do_test (test, repeat, "en_US.UTF-8", text_random_bytes,
+ sizeof text_random_bytes - 1);
+ break;
+ default:
+ /* Ignore. */
+ ;
+ }
+ }
+
+ return 0;
+}
diff --git a/tests/bench-multibyte.h b/tests/bench-multibyte.h
index d1aec951a0..6e475ada15 100644
--- a/tests/bench-multibyte.h
+++ b/tests/bench-multibyte.h
@@ -21,7 +21,9 @@
Liber I, Sermo IX
*/
static char const text_latin_ascii[] =
+#define TEXT_LATIN_ASCII_LINE1 \
"ibam forte via sacra, sicut meus est mos,\n"
+ TEXT_LATIN_ASCII_LINE1
"nescio quid meditans nugarum, totus in illis:\n"
"accurrit quidam notus mihi nomine tantum\n"
"arreptaque manu 'quid agis, dulcissime rerum?'\n"
@@ -102,7 +104,9 @@ static char const text_latin_ascii[] =
;
static char const text_french_utf8[] =
+#define TEXT_FRENCH_UTF8_LINE1 \
"J'errais par hasard sur une voie sacrée, comme c'est ma coutume,\n"
+ TEXT_FRENCH_UTF8_LINE1
"Méditant je ne sais quoi de frivole, totalement absorbé par ces pensées :\n"
"Arrive soudain quelqu'un de connu, seulement par son nom,\n"
"Et me saisissant par la main, il dit : « Comment vas-tu, ô douceur des choses ? »\n"
@@ -185,7 +189,9 @@ static char const text_french_utf8[] =
static char const *text_french_iso8859;
static char const text_greek_utf8[] =
+#define TEXT_GREEK_UTF8_LINE1 \
"περιπάτων μέντοι κατά την ιερή οδό, όπως είναι η συνήθειά μου,\n"
+ TEXT_GREEK_UTF8_LINE1
"σκεφτόμενος άσχημα πράγματα, πλήρως αφοσιωμένος σε αυτά:\n"
"έρχεται ένας γνωστός με όνομα μόνον,\n"
"και αρπάζοντας το χέρι μου, λέει \"τι κάνεις, πιο γλυκέ των πραγμάτων;\"\n"
@@ -261,7 +267,9 @@ static char const text_greek_utf8[] =
static char const *text_greek_iso8859;
static char const text_chinese_utf8[] =
+#define TEXT_CHINESE_UTF8_LINE1 \
"我偶然走在圣路上,正如我的习惯,\n"
+ TEXT_CHINESE_UTF8_LINE1
"心里想着一些无聊的事情,全神贯注其中:\n"
"突然有个熟人从我身边跑过,只知道我的名字,\n"
"他一把抓住我的手说:“你好,最甜蜜的人!”\n"
@@ -337,6 +345,137 @@ static char const text_chinese_utf8[] =
static char const *text_chinese_gb18030;
+/* 2000 random bytes (including NUL bytes) followed by NUL. Generated by:
+ od -An -N2000 -to1 /dev/urandom | sed 's/ *''/\\/g; s/.*''/ "&"/'
+ in the C locale. */
+static char const text_random_bytes[] =
+ "\002\025\262\356\251\052\313\037\234\000\160\247\162\250\011\140"
+ "\212\121\014\223\070\256\312\363\204\362\130\226\374\256\365\364"
+ "\173\131\373\270\066\034\021\216\072\021\050\250\106\146\167\327"
+ "\031\301\160\324\346\334\250\111\066\377\315\004\355\167\225\176"
+ "\257\070\334\005\354\337\320\037\272\172\156\042\312\077\134\217"
+ "\116\240\022\232\014\244\225\114\354\204\224\212\130\062\360\312"
+ "\076\323\154\332\127\230\050\377\263\165\346\371\244\070\140\120"
+ "\371\313\311\232\256\244\150\003\320\132\045\257\001\112\057\264"
+ "\111\334\370\033\022\246\347\224\032\112\130\166\263\140\204\310"
+ "\323\315\214\265\313\172\275\100\020\311\215\207\061\031\000\101"
+ "\132\044\050\020\372\003\011\347\135\120\026\367\376\213\336\061"
+ "\117\223\005\217\330\217\227\121\134\011\353\137\247\255\000\353"
+ "\376\147\004\152\261\306\106\341\364\355\067\047\261\167\076\066"
+ "\102\353\026\203\165\226\245\270\036\222\003\134\112\200\375\314"
+ "\204\023\351\021\240\123\211\165\103\210\100\030\377\205\162\307"
+ "\027\024\342\231\216\121\113\243\151\243\045\237\351\346\016\320"
+ "\374\127\314\272\226\371\072\030\134\021\311\202\252\060\263\305"
+ "\262\261\043\065\341\265\364\225\047\140\347\025\073\054\060\053"
+ "\345\202\031\234\246\201\164\313\251\076\022\214\121\331\376\160"
+ "\237\145\122\264\214\073\277\254\020\020\322\030\006\221\261\355"
+ "\366\023\162\326\137\147\137\132\005\223\312\123\103\330\127\341"
+ "\207\240\175\036\277\075\213\255\031\223\366\060\350\361\271\122"
+ "\274\145\174\030\333\230\077\323\104\031\062\374\077\345\276\154"
+ "\224\006\346\376\101\040\156\060\227\172\336\156\305\050\225\236"
+ "\207\233\253\232\062\021\003\110\035\266\315\342\114\162\126\050"
+ "\146\216\165\345\125\061\137\350\307\236\205\350\026\221\267\305"
+ "\051\115\130\050\103\141\077\251\131\326\262\232\164\060\056\165"
+ "\152\027\145\144\323\030\065\247\321\317\153\316\363\232\271\222"
+ "\372\012\223\256\064\354\243\305\002\333\075\143\366\214\270\016"
+ "\041\320\336\070\250\070\354\354\373\157\365\204\122\215\131\246"
+ "\176\147\122\221\101\331\366\001\325\354\271\227\010\152\050\060"
+ "\011\254\317\037\107\024\374\127\042\250\012\123\355\216\207\012"
+ "\210\007\252\043\244\023\125\142\246\250\325\275\136\247\260\177"
+ "\363\063\063\315\263\134\134\347\373\005\001\373\354\274\302\177"
+ "\253\343\324\031\050\126\371\251\146\224\276\374\100\054\165\011"
+ "\040\032\243\014\320\030\237\111\065\353\043\057\141\343\256\265"
+ "\134\221\214\250\242\171\056\277\146\370\031\057\334\352\235\154"
+ "\240\233\027\106\206\317\237\236\356\325\241\272\064\137\227\263"
+ "\371\043\003\327\117\320\026\313\323\244\077\174\067\273\136\213"
+ "\370\057\170\024\266\046\075\045\234\257\311\230\216\303\367\357"
+ "\217\021\312\241\210\323\341\220\331\017\354\113\054\171\377\007"
+ "\341\171\157\145\371\025\005\112\137\241\271\352\156\161\107\231"
+ "\006\365\331\020\023\366\337\336\341\352\014\213\045\337\206\032"
+ "\116\230\206\000\353\074\311\240\102\004\124\251\212\261\336\322"
+ "\251\344\347\040\205\025\267\324\315\142\164\366\330\047\066\122"
+ "\205\270\200\316\142\252\351\246\350\122\217\336\222\266\350\124"
+ "\350\370\170\360\256\066\206\043\175\335\054\037\112\131\166\266"
+ "\245\054\221\370\370\344\310\332\006\253\317\071\161\310\243\035"
+ "\367\212\233\274\043\331\140\212\353\017\022\277\162\020\027\356"
+ "\130\040\140\350\016\205\311\156\102\144\250\100\123\334\374\300"
+ "\171\353\317\273\126\204\016\200\346\155\172\016\047\357\135\277"
+ "\045\216\276\214\017\202\231\000\377\176\005\043\301\277\274\052"
+ "\223\101\127\212\260\123\011\051\067\110\330\322\061\272\225\127"
+ "\061\011\031\305\043\243\352\376\376\257\035\050\304\267\174\177"
+ "\021\171\220\356\004\166\036\307\044\005\305\266\136\042\156\043"
+ "\240\226\115\021\202\020\354\011\042\355\156\237\323\006\164\317"
+ "\054\212\330\361\373\114\324\325\136\041\367\024\025\247\330\207"
+ "\136\075\004\067\220\036\034\231\166\135\066\366\041\061\055\256"
+ "\370\340\323\026\234\333\356\076\174\267\124\104\265\050\035\061"
+ "\102\052\157\034\167\217\362\031\064\035\313\276\334\317\223\363"
+ "\166\306\004\341\010\204\216\066\325\073\170\000\024\263\366\116"
+ "\336\346\023\244\242\334\002\376\257\266\312\004\275\047\272\161"
+ "\317\373\307\146\252\150\007\072\363\014\166\204\370\303\211\044"
+ "\352\152\076\226\017\252\223\054\126\111\121\315\301\305\111\117"
+ "\375\204\120\337\152\125\153\320\012\224\344\051\351\227\164\226"
+ "\226\243\167\254\133\133\377\125\101\223\375\147\231\074\307\312"
+ "\025\324\156\321\144\154\311\262\241\351\374\341\204\050\343\132"
+ "\271\015\176\147\117\354\312\244\360\276\027\142\261\351\321\337"
+ "\347\065\353\242\145\253\242\270\347\175\330\070\113\030\307\372"
+ "\032\215\304\335\364\062\051\265\040\260\175\312\301\356\164\140"
+ "\003\071\305\371\374\270\135\370\324\001\114\011\035\131\036\300"
+ "\162\117\245\007\311\240\317\230\166\114\140\054\175\256\026\176"
+ "\107\341\217\105\032\142\117\036\127\230\355\223\117\272\003\104"
+ "\143\041\140\146\046\004\305\376\047\174\334\033\271\156\026\057"
+ "\306\363\160\323\156\151\347\347\232\052\111\321\315\337\174\023"
+ "\363\312\044\102\310\116\217\226\052\367\344\342\144\235\146\303"
+ "\044\351\375\212\237\052\071\335\173\042\057\105\141\101\164\175"
+ "\116\366\040\134\075\125\312\233\216\262\345\362\270\274\111\017"
+ "\134\272\126\234\053\066\300\026\151\274\317\213\022\202\372\245"
+ "\134\056\000\300\152\265\346\207\272\061\135\055\257\134\100\260"
+ "\015\056\015\004\365\365\040\031\047\225\326\135\252\031\071\111"
+ "\203\032\174\102\072\013\012\135\152\125\060\201\111\074\112\235"
+ "\053\266\077\360\302\226\161\247\005\216\050\122\307\064\112\257"
+ "\131\017\354\101\031\226\200\040\353\015\303\073\007\025\376\163"
+ "\174\050\125\135\366\064\051\044\306\003\271\057\023\300\265\254"
+ "\202\110\171\241\150\207\153\266\304\336\265\052\151\123\345\215"
+ "\061\314\310\344\364\161\266\052\010\372\067\007\010\217\121\067"
+ "\325\145\350\037\066\333\165\142\233\137\016\313\202\162\354\341"
+ "\241\126\337\361\107\062\106\140\222\027\022\260\305\156\206\200"
+ "\250\251\347\040\115\157\314\114\336\030\256\134\142\270\340\242"
+ "\137\215\317\075\112\073\134\007\353\325\310\036\367\050\102\263"
+ "\302\113\311\203\254\152\157\216\314\016\140\143\172\154\150\217"
+ "\054\024\376\121\336\204\144\210\311\073\204\160\316\363\113\021"
+ "\020\142\251\251\053\372\336\220\107\152\011\064\120\063\017\154"
+ "\132\266\012\376\171\174\176\230\053\272\104\055\201\103\234\000"
+ "\171\044\000\102\211\365\157\330\321\357\065\025\146\127\212\035"
+ "\217\224\101\307\016\221\131\231\266\264\166\245\142\032\141\051"
+ "\156\110\253\076\054\322\316\007\274\270\240\131\151\131\037\367"
+ "\317\162\313\021\367\224\214\301\070\346\121\242\016\215\337\317"
+ "\015\030\221\375\156\335\113\173\037\126\056\045\222\250\236\117"
+ "\116\230\366\245\210\007\144\001\326\344\350\051\226\215\266\350"
+ "\247\231\021\121\057\167\341\307\211\336\362\244\006\124\321\120"
+ "\375\211\074\271\174\166\304\347\164\061\072\254\104\332\373\266"
+ "\042\140\031\277\366\171\245\076\003\113\105\044\327\356\053\122"
+ "\017\056\317\117\244\133\217\012\357\004\327\236\362\170\266\112"
+ "\162\171\276\042\226\145\365\013\360\167\067\363\234\212\211\146"
+ "\135\334\126\005\351\317\173\060\311\074\126\162\072\217\141\227"
+ "\311\013\074\321\176\371\103\206\266\340\004\046\070\150\356\173"
+ "\351\024\322\073\007\207\345\372\344\242\074\006\266\337\214\117"
+ "\172\253\340\063\120\052\236\260\150\252\161\334\047\145\154\204"
+ "\007\334\374\346\322\241\246\126\123\137\016\315\134\370\126\064"
+ "\230\014\107\063\172\371\246\262\211\016\364\215\064\340\046\040"
+ "\057\023\016\072\377\013\303\161\273\264\240\152\223\312\361\117"
+ "\321\253\232\217\142\054\367\156\210\066\215\203\031\346\377\371"
+ "\253\067\001\044\245\177\077\236\170\255\226\331\301\360\216\161"
+ "\343\325\267\015\025\165\322\134\113\304\030\047\372\064\176\367"
+ "\212\272\325\372\103\126\207\002\022\320\213\174\230\311\012\115"
+ "\006\037\204\336\337\116\163\345\143\005\365\012\147\343\213\040"
+ "\247\337\207\225\341\346\373\176\214\232\250\275\075\276\054\017"
+ "\270\112\220\121\205\061\111\125\256\140\212\163\033\317\234\322"
+ "\076\321\243\054\301\273\252\063\106\257\005\176\360\334\370\047"
+ "\055\134\056\243\356\004\105\331\014\150\304\366\350\143\142\020"
+ "\170\171\324\217\142\041\121\165\375\173\377\255\256\167\000\213"
+ "\144\006\122\032\322\355\166\020\361\065\073\336\306\145\004\271"
+ "\277\005\016\001\116\371\000\376\033\253\117\114\044\215\314\257"
+ ;
+
static void
text_init ()
{
diff --git a/tests/bench.h b/tests/bench.h
index 4bfdbd4ec1..caac5e5ab6 100644
--- a/tests/bench.h
+++ b/tests/bench.h
@@ -60,7 +60,7 @@ timing_end (struct timings_state *ts)
+ usage.ru_stime.tv_usec - ts->sys_start.tv_usec;
}
-static void
+_GL_UNUSED static void
timing_output (const struct timings_state *ts)
{
printf ("real %10.6f\n", (double)ts->real_usec / 1000000.0);
--
2.39.2
From 7b430a277a2443d968dac2735630ded561028987 Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Thu, 7 Sep 2023 14:51:57 -0700
Subject: [PATCH 4/7] mcel-prefer: new module
* modules/mcel-prefer: New file.
---
ChangeLog | 3 +++
modules/mcel-prefer | 28 ++++++++++++++++++++++++++++
2 files changed, 31 insertions(+)
create mode 100644 modules/mcel-prefer
diff --git a/ChangeLog b/ChangeLog
index cbb1979acb..5c967214ed 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,8 @@
2023-09-07 Paul Eggert <egg...@cs.ucla.edu>
+ mcel-prefer: new module
+ * modules/mcel-prefer: New file.
+
mcel-bench-tests: new module
* modules/mcel-bench-tests, tests/bench-mcel.c: New files.
* tests/bench-multibyte.h (TEXT_LATIN_ASCII_LINE1)
diff --git a/modules/mcel-prefer b/modules/mcel-prefer
new file mode 100644
index 0000000000..5c5ac24054
--- /dev/null
+++ b/modules/mcel-prefer
@@ -0,0 +1,28 @@
+Description:
+Prefer mcel to the mbiter family. mcel is simpler and can be faster.
+However, it does not support some obsolete encodings that are also not
+supported by glibc locales, and the caller is responsible for
+coalescing sequences of error-encoding bytes if that is desired.
+
+Files:
+
+Depends-on:
+mcel
+
+configure.ac-early:
+# Prefer mcel by default. This can be overridden via
+# './configure GNULIB_MCEL_PREFER=no'.
+: ${GNULIB_MCEL_PREFER=yes}
+
+configure.ac:
+gl_MODULE_INDICATOR([mcel-prefer])
+
+Makefile.am:
+
+Include:
+
+License:
+LGPLv2+
+
+Maintainer:
+Paul Eggert
--
2.39.2
From 7219d38b5716cd25af2eb177c03948b9908e09c6 Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Thu, 7 Sep 2023 14:51:58 -0700
Subject: [PATCH 5/7] exclude: support GNULIB_MCEL_PREFER
Support mcel API for apps that prefer it.
The following changes are in effect only if GNULIB_MCEL_PREFER.
* lib/exclude.c: Include mcel.h instead of mbuiter.h.
(string_hasher_ci): Use mcel_scanz instead of mbui_init,
mbui_avail, mbui_cur, and mbui_advance.
* modules/exclude: Do not depend on mbuiter.
---
ChangeLog | 8 ++++++++
lib/exclude.c | 16 +++++++++++++++-
modules/exclude | 2 +-
3 files changed, 24 insertions(+), 2 deletions(-)
diff --git a/ChangeLog b/ChangeLog
index 5c967214ed..ba49a1177b 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,13 @@
2023-09-07 Paul Eggert <egg...@cs.ucla.edu>
+ exclude: support GNULIB_MCEL_PREFER
+ Support mcel API for apps that prefer it.
+ The following changes are in effect only if GNULIB_MCEL_PREFER.
+ * lib/exclude.c: Include mcel.h instead of mbuiter.h.
+ (string_hasher_ci): Use mcel_scanz instead of mbui_init,
+ mbui_avail, mbui_cur, and mbui_advance.
+ * modules/exclude: Do not depend on mbuiter.
+
mcel-prefer: new module
* modules/mcel-prefer: New file.
diff --git a/lib/exclude.c b/lib/exclude.c
index d1ecaedfc6..a3479db8a6 100644
--- a/lib/exclude.c
+++ b/lib/exclude.c
@@ -36,7 +36,11 @@
#include "filename.h"
#include "fnmatch.h"
#include "hash.h"
-#include "mbuiter.h"
+#if GNULIB_MCEL_PREFER
+# include "mcel.h"
+#else
+# include "mbuiter.h"
+#endif
#include "xalloc.h"
#if GNULIB_EXCLUDE_SINGLE_THREAD
@@ -204,7 +208,16 @@ string_hasher_ci (void const *data, size_t n_buckets)
char const *p = data;
size_t value = 0;
+#if GNULIB_MCEL_PREFER
+ while (*p)
+ {
+ mcel_t g = mcel_scanz (p);
+ value = value * 31 + (c32tolower (g.ch) - g.err);
+ p += g.len;
+ }
+#else
mbui_iterator_t iter;
+
for (mbui_init (iter, p); mbui_avail (iter); mbui_advance (iter))
{
mbchar_t m = mbui_cur (iter);
@@ -217,6 +230,7 @@ string_hasher_ci (void const *data, size_t n_buckets)
value = value * 31 + wc;
}
+#endif
return value % n_buckets;
}
diff --git a/modules/exclude b/modules/exclude
index 8adae5400f..92f8d3c472 100644
--- a/modules/exclude
+++ b/modules/exclude
@@ -13,7 +13,7 @@ fnmatch
fopen-gnu
hash
mbscasecmp
-mbuiter
+mbuiter [test "$GNULIB_MCEL_PREFER" != yes]
nullptr
regex
stdbool
--
2.39.2
From 6887c63276334e1ca7875c431eef503553527f17 Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Thu, 7 Sep 2023 14:51:58 -0700
Subject: [PATCH 6/7] mbscasecmp: support GNULIB_MCEL_PREFER
* lib/mbscasecmp.c: Include stdlib.h, since we use MB_CUR_MAX.
Include uchar.h, for c32tolower.
(GNULIB_MCEL_PREFER): Include mcel.h instead of mbuiterf.h.
(mbscasecmp) [GNULIB_MCEL_PREFER]: Use mcel instead of mbuiterf.
* modules/mbscasecmp (Depends-on): Add c32tolower, stdlib, uchar.
Depend on mbuiterf only if not preferring mcel.
---
ChangeLog | 8 ++++++++
lib/mbscasecmp.c | 19 ++++++++++++++++++-
modules/mbscasecmp | 5 ++++-
3 files changed, 30 insertions(+), 2 deletions(-)
diff --git a/ChangeLog b/ChangeLog
index ba49a1177b..2aec64130e 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,13 @@
2023-09-07 Paul Eggert <egg...@cs.ucla.edu>
+ mbscasecmp: support GNULIB_MCEL_PREFER
+ * lib/mbscasecmp.c: Include stdlib.h, since we use MB_CUR_MAX.
+ Include uchar.h, for c32tolower.
+ (GNULIB_MCEL_PREFER): Include mcel.h instead of mbuiterf.h.
+ (mbscasecmp) [GNULIB_MCEL_PREFER]: Use mcel instead of mbuiterf.
+ * modules/mbscasecmp (Depends-on): Add c32tolower, stdlib, uchar.
+ Depend on mbuiterf only if not preferring mcel.
+
exclude: support GNULIB_MCEL_PREFER
Support mcel API for apps that prefer it.
The following changes are in effect only if GNULIB_MCEL_PREFER.
diff --git a/lib/mbscasecmp.c b/lib/mbscasecmp.c
index 80dc18529d..5e0bc67dc0 100644
--- a/lib/mbscasecmp.c
+++ b/lib/mbscasecmp.c
@@ -23,8 +23,14 @@
#include <ctype.h>
#include <limits.h>
+#include <stdlib.h>
+#include <uchar.h>
-#include "mbuiterf.h"
+#if GNULIB_MCEL_PREFER
+# include "mcel.h"
+#else
+# include "mbuiterf.h"
+#endif
/* Compare the character strings S1 and S2, ignoring case, returning less than,
equal to or greater than zero if S1 is lexicographically less than, equal to
@@ -45,6 +51,16 @@ mbscasecmp (const char *s1, const char *s2)
most often already in the very few first characters. */
if (MB_CUR_MAX > 1)
{
+#if GNULIB_MCEL_PREFER
+ while (true)
+ {
+ mcel_t g1 = mcel_scanz (iter1); iter1 += g1.len;
+ mcel_t g2 = mcel_scanz (iter2); iter2 += g2.len;
+ int cmp = mcel_tocmp (c32tolower, g1, g2);
+ if (cmp | !g1.ch)
+ return cmp;
+ }
+#else
mbuif_state_t state1;
mbuif_init (state1);
@@ -70,6 +86,7 @@ mbscasecmp (const char *s1, const char *s2)
/* s1 terminated before s2. */
return -1;
return 0;
+#endif
}
else
for (;;)
diff --git a/modules/mbscasecmp b/modules/mbscasecmp
index 234b7bc7a3..2fd8f1f4ce 100644
--- a/modules/mbscasecmp
+++ b/modules/mbscasecmp
@@ -5,8 +5,11 @@ Files:
lib/mbscasecmp.c
Depends-on:
-mbuiterf
+c32tolower
+mbuiterf [test "$GNULIB_MCEL_PREFER" != yes]
+stdlib
string
+uchar
configure.ac:
gl_STRING_MODULE_INDICATOR([mbscasecmp])
--
2.39.2
From 88998ee4c04e0458620bf25b9fb251568d9e2eaa Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Thu, 7 Sep 2023 14:51:58 -0700
Subject: [PATCH 7/7] Update strings doc
* doc/strings.texi: Mention mbiterf, mbuiterf, mcel, and mcel-prefer.
---
doc/strings.texi | 88 +++++++++++++++++++++++++++++++++++++-----------
1 file changed, 69 insertions(+), 19 deletions(-)
diff --git a/doc/strings.texi b/doc/strings.texi
index 32165ac403..5c901493cc 100644
--- a/doc/strings.texi
+++ b/doc/strings.texi
@@ -26,6 +26,7 @@ in memory of a running C program.
@menu
* C strings::
+* Iterating through strings::
* Strings with NUL characters::
* String Functions in C Locale::
* Comparison of string APIs::
@@ -176,35 +177,84 @@ Gnulib has functions @func{mbscasecmp}, @func{mbsncasecmp},
function @code{ulc_casecmp} is preferable to these functions.
@end itemize
-Gnulib also has additional API.
+@cartouche
+@emph{A C string can contain encoding errors.}
+@end cartouche
-@menu
-* Iterating through strings::
-@end menu
+Not every NUL-terminated byte sequence represents a valid multibyte
+string. Byte sequences can contain encoding errors, that is, bytes or
+byte sequences that are invalid and do not represent characters.
+
+String functions like @code{mbscasecmp} and @code{strcoll} whose
+behavior depends on encoding have unspecified behavior on strings
+containing encoding errors, unless the behavior is specifically
+documented. If an application needs a particular behavior on these
+strings it can iterate through them itself, as described in the next
+subsection.
@node Iterating through strings
-@subsubsection Iterating through strings
+@subsection Iterating through strings
-For complex string processing, the provided string functions may not be
-enough, and what you need is a way to iterate through a string while
-processing each (possibly multibyte) character in turn. Gnulib provides
-two modules for this purpose. Both iterate through the string in
-forward direction. Iteration in backward direction, that is, from the
-string's end to start, is not provided, as it is too hairy in general.
+For complex string processing, string functions may not be enough, and
+you need to iterate through a string while processing each (possibly
+multibyte) character or encoding error in turn. Gnulib has several
+modules for iterating forward through a string in this way. Backward
+iteration, that is, from the string's end to start, is not provided,
+as it is too hairy in general.
@itemize
@item
-The @code{mbiter} module. It iterates through a C string whose length
-is already known.
+The @code{mbiter} module iterates through a string whose length
+is already known. The string can contain NULs and encoding errors.
+@item
+The @code{mbiterf} module is like @code{mbiter}
+except it is more complex and typically faster.
+@item
+The @code{mbuiter} module iterates through a C string whose length
+is not a-priori known. The string can contain encoding errors and is
+terminated by the first NUL.
+@item
+The @code{mbuiterf} module is like @code{mbuiter}
+except it is more complex and typically faster.
+@item
+The @code{mcel} module is simpler than @code{mbiter} and @code{mbuiter}
+and can be faster than even @code{mbiterf} and @code{mbuiterf}.
+It can iterate through either strings whose length is known, or
+C strings, or strings terminated by other ASCII characters < 0x30.
@item
-The @code{mbuiter} module. It iterates through a C string whose length
-is not a-priori known.
+The @code{mcel-prefer} module is like @code{mcel} except that it
+causes some other modules to be based on @code{mcel} instead of
+on the @code{mbiter} family.
@end itemize
-The @code{mbuiter} module is suitable when there is a high probability
-that only the first few multibyte characters need to be inspected.
-Whereas the @code{mbiter} module is better if usually the iteration runs
-through the entire string.
+The choice of modules depends on the application's needs. The
+@code{mbiter} module family is more suitable for applications that
+treat some sequences of two or more bytes as a single encoding error,
+and for applications that need to support obsolescent encodings on
+non-GNU platforms, such as CP864, EBCDIC, Johab, and Shift JIS.
+In this module family, @code{mbuiter} and @code{mbuiterf} are more
+suitable than @code{mbiter} and @code{mbiterf} when arguments are C strings,
+lengths are not already known, and it is highly likely that only the
+first few multibyte characters need to be inspected.
+
+The @code{mcel} module is simpler and can be faster than the
+@code{mbiter} family, and is more suitable for applications that do
+not need the @code{mbiter} family's special features.
+
+The @code{mcel-prefer} module is like @code{mcel} except that it also
+causes some other modules, such as @code{mbscasecmp}, to use
+@code{mcel} rather than the @code{mbiter} family. This can be simpler
+and faster. However, it does not support the obsolescent encodings,
+and it may behave differently on data containing encoding errors where
+behavior is unspecified or undefined, because in @code{mcel} each
+encoding error is a single byte whereas in the @code{mbiter} family a
+single encoding error can contain two or more bytes.
+
+If a package uses @code{mcel-prefer}, it may also want to give
+@command{gnulib-tool} one or more of the options
+@option{--avoid=mbiter}, @option{--avoid=mbiterf},
+@option{--avoid=mbuiter} and @option{--avoid=mbuiterf},
+to avoid packaging modules that are not needed.
@node Strings with NUL characters
@subsection Strings with NUL characters
--
2.39.2