On 2023-07-11 15:14, Bruno Haible wrote:
The performance problems that I see are:
- glibc's conversion functions are optimized for long sequences
(think of iconv()). They are not optimized for short invocations
(one multibyte character or less). This is a long-standing problem,
that no one is attacking.
Yes, but let's defer that for now. Attacking it would require rethinking
the mbrtowc/mbrtoc32 interface entirely, which is a bigger task.
- glibc's UTF-8 converter is very slow for texts with many non-ASCII
characters
Although there are ways to attack that, they're a bigger lift and for
now I'd like to defer this too.
- In the C locale (tests a, c, f), conversions of bytes < 0x80 are
cheap, whereas conversions of bytes >= 0x80 are expensive, because
in this code path, glibc returns (size_t)-1 and mbrtoc32.c invokes
hard_locale.
There are opportunities for performance improvement here in the mbcel
world. A new scanning primitive could accept an argument that is the
value of MB_LEN_MAX or that is true for the C locale, or whatever.
Callers could hoist that out of loops and this could improve speed further.
However, diff already hoists for other reasons and does not use mbcel in
single-byte locales, so this sort of thing wouldn't help diff. This,
plus the extra complexity the additional scanning primitive would add to
mbcel.h, is why I didn't implement such a primitive.
Can we optimize the need for calling hard_locale so often, somehow?
Or create a variant of mbrtoc32 that fetches the value of hard_locale
from some cache (maybe a __thread variable)?
Or can hard_locale itself be optimized (through dirty, glibc specific
hacks)?
All possible, I suppose, but as described above I wonder whether it
would be worth the effort in the mbcel world.
I do *not* see a performance problem with character encodings such as
ISO-8859-7 or GB18030 (tests g, j): the figures are comparable with UTF-8.
Yes, though mbcel should speed up both UTF-8 and the other encodings,
compared to mbiter.
How can this be? The Emacs source code is mostly ASCII, and the figures
above (test a, b) show that for this case, mbiter is well optimized.
Although mbiter is faster on ASCII than on non-ASCII, it's not
well-optimized compared to mbcel. On Fedora x86-64 here is the kernel of
an mbcel-based loop that merely scans ASCII and adds each byte's
numerical value to a sum:
.L28: addq %rax, %rbp
movl $1, %eax
addq %rax, %rbx
cmpq %r12, %rbx
jnb .L19
movsbq (%rbx), %rax
testb %al, %al
jns .L28
where %rbp = sum, %rbx = pointer to next byte, and %r12 = pointer just
past end of input.
In contrast, with mbiter the kernel is:
.L24: movq 136(%rsp), %rax
movq 112(%rsp), %r14
movq $1, 144(%rsp)
movsbl (%rbx), %edx
movb $1, 152(%rsp)
leaq 1(%rax), %rbx
movl %edx, 156(%rsp)
movq %rbx, 136(%rsp)
addq %rdx, %r15
movb $0, 128(%rsp)
cmpq %r14, %rbx
jnb .L5
.L14: movzbl (%rbx), %ecx
movl %ecx, %eax
shrb $5, %al
andl $7, %eax
movl is_basic_table(,%rax,4), %eax
shrl %cl, %eax
testb $1, %al
jne .L24
where %r15 = sum, %rbx = pointer to next byte, %r14 = pointer just past
end of input.
I'd better try to copy the worthy optimizations into mbiter, mbuiter.
I don't see how, offhand. Although I'm sure mbiter can be improved I
don't see how it could catch up to mbcel so long as it continues to
solve a harder problem than mbcel solves.
Candidates for optimization:
- The C locale handling
https://sourceware.org/bugzilla/show_bug.cgi?id=19932
https://sourceware.org/bugzilla/show_bug.cgi?id=29511
It's now a clear POSIX violation. Would it make sense to get this fixed
in glibc, so that gnulib's override can be dropped on future glibc
versions?
Absolutely. Does наб's patch in the latter bug report look good to you?
If so, I can propose it on libc-alpha after the next release (as it's
too late for the next release).
- Resetting an mbstate_t: Should we define a function
void mbszero (mbstate_t *);
that clears the relevant part of an mbstate_t (i.e. 24 bytes instead
of 128 bytes on BSD systems)?
Advantage: performance.
Drawback: Yet another gnulib-invented, nonstandard API.
It's likely worth it for mbcel on BSDish hosts. Quite possibly it's also
worth it for mbiter and mbuiter. Not sure it's worth it everywhere.
Also I still think it's at most 12 bytes, not 24, though I guess that
thread <https://lists.gnu.org/r/bug-gnulib/2023-07/msg00044.html> is
still ongoing.
- Is a functional interface faster than one that gets a 'struct' passed
by reference? I would guess no, since gcc optimizes both cases well,
especially when inlining. But feel free to prove me wrong.
In my benchmarks mbcel is significantly faster than mbiter on current
(commit 6a455aab6ef2c6efd6703cfa9319268b503a8e14) Gnulib benchmarks,
even when mbrtoc32-regular is used. You should be able to verify this by
running:
./gnulib-tool --create-testdir --dir b mbrtoc32-regular
mbiter-bench-tests
and then applying the attached patch, running "make check", and then
running these commands in the gltests subdirectory:
./bench-mbiter abcdefghij 100000
./bench-mbcel abcdefghij 100000
Here's a summary of the results I got on Fedora 38 x86-64 on an AMD
Phenom II X4 910e processor dated 2010.
user CPU sec speedup
mbiter mbcel factor test
1.735 0.478 3.630 a - ASCII text, C locale
1.703 0.447 3.810 b - ASCII text, UTF-8 locale
3.852 1.514 2.544 c - French text, C locale
3.544 1.600 2.215 d - French text, ISO-8859-1 locale
3.651 1.662 2.197 e - French text, UTF-8 locale
26.787 15.115 1.772 f - Greek text, C locale
21.651 17.106 1.266 g - Greek text, ISO-8859-7 locale
22.565 17.633 1.280 h - Greek text, UTF-8 locale
10.011 8.051 1.243 i - Chinese text, UTF-8 locale
9.787 7.967 1.228 j - Chinese text, GB18030 locale
With a better CPU (a Xeon W-1350 dated 2021) and a slightly-slower OS
(Ubuntu 23.04) I got these numbers:
user CPU sec speedup
mbiter mbcel factor test
0.531 0.238 2.231 a - ASCII text, C locale
0.478 0.187 2.556 b - ASCII text, UTF-8 locale
1.262 0.510 2.475 c - French text, C locale
1.121 0.529 2.119 d - French text, ISO-8859-1 locale
1.080 0.571 1.891 e - French text, UTF-8 locale
10.349 5.876 1.761 f - Greek text, C locale
8.530 6.537 1.305 g - Greek text, ISO-8859-7 locale
8.407 6.506 1.292 h - Greek text, UTF-8 locale
3.427 2.578 1.329 i - Chinese text, UTF-8 locale
3.279 2.489 1.317 j - Chinese text, GB18030 locale
diff -pruN b/gllib/mbcel.h c/gllib/mbcel.h
--- b/gllib/mbcel.h 1969-12-31 16:00:00.000000000 -0800
+++ c/gllib/mbcel.h 2023-07-12 09:16:09.705062277 -0700
@@ -0,0 +1,142 @@
+/* Multi-byte characters, error encodings, and lengths
+ Copyright 2023 Free Software Foundation, Inc.
+
+ This file is free software: you can redistribute it and/or modify
+ it under the terms of the GNU Lesser General Public License as
+ published by the Free Software Foundation; either version 2.1 of the
+ License, or (at your option) any later version.
+
+ This file is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU Lesser General Public License for more details.
+
+ You should have received a copy of the GNU Lesser General Public License
+ along with this program. If not, see <https://www.gnu.org/licenses/>. */
+
+/* Written by Paul Eggert. */
+
+/* The mbcel_scan function lets code iterate through an array of bytes,
+ supporting character encodings in practical use
+ more simply than using plain mbrtoc32.
+
+ Instead of this single-byte code:
+
+ char *p = ..., *lim = ...;
+ for (; p < lim; p++)
+ process (*p);
+
+ You can use this multi-byte code:
+
+ char *p = ..., *lim = ...;
+ for (mbcel_t g; p < lim; p += g.len)
+ {
+ g = mbcel_scan (p, lim);
+ process (g);
+ }
+
+ You can select from G using G.ch, G.err, and G.len.
+
+ Although ISO C and POSIX allow encodings that have shift states or
+ that can produce multiple characters from an indivisible byte sequence,
+ POSIX does not require support for these encodings,
+ they are not in practical use on GNUish platforms,
+ and omitting support for them simplifies the API. */
+
+#ifndef _MBCEL_H
+#define _MBCEL_H 1
+
+/* This file uses _GL_INLINE_HEADER_BEGIN, _GL_INLINE. */
+#if !_GL_CONFIG_H_INCLUDED
+ #error "Please include config.h first."
+#endif
+
+#include <limits.h>
+#include <stddef.h>
+#include <uchar.h>
+
+/* mbcel_t is a type representing a character CH or an encoding error byte ERR,
+ along with a count of the LEN bytes that represent CH or ERR.
+ If ERR is zero, CH is a valid character and 1 <= LEN <= MB_LEN_MAX;
+ otherwise ERR is an encoding error byte, 0x80 <= ERR <= UCHAR_MAX,
+ CH == 0, and LEN == 1. */
+typedef struct
+{
+ char32_t ch;
+ unsigned char err;
+ unsigned char len;
+} mbcel_t;
+
+/* On all known platforms, every multi-byte character length fits in
+ mbcel_t's LEN. Check this. */
+static_assert (MB_LEN_MAX <= UCHAR_MAX);
+
+/* Pacify GCC re '*p <= 0x7f' below. */
+#if defined __GNUC__ && 4 < __GNUC__ + (3 <= __GNUC_MINOR__)
+# pragma GCC diagnostic ignored "-Wtype-limits"
+#endif
+
+_GL_INLINE_HEADER_BEGIN
+#ifndef MBCEL_INLINE
+# define MBCEL_INLINE _GL_INLINE
+#endif
+
+/* With diffutils there is no need for the performance overhead of
+ replacing glibc mbrtoc32, as it doesn't matter whether the C locale
+ treats bytes with the high bit set as encoding errors. */
+#ifdef __GLIBC__
+# undef mbrtoc32
+#endif
+
+/* Scan bytes from P inclusive to LIM exclusive. P must be less than LIM.
+ Return either the representation of the valid character starting at P,
+ or the representation of an encoding error of length 1 at P. */
+MBCEL_INLINE mbcel_t
+mbcel_scan (char const *p, char const *lim)
+{
+ /* Handle ASCII quickly to avoid the overhead of calling mbrtoc32.
+ In supported encodings, the first byte of a multi-byte character
+ cannot be an ASCII byte. */
+ if (0 <= *p && *p <= 0x7f)
+ return (mbcel_t) { .ch = *p, .len = 1 };
+
+ /* An initial mbstate_t; initialization optimized for some platforms. */
+#if defined __GLIBC__ && 2 < __GLIBC__ + (2 <= __GLIBC_MINOR__)
+ mbstate_t mbs; mbs.__count = 0;
+#elif (defined __FreeBSD__ || defined __DragonFly__ || defined __OpenBSD__ \
+ || (defined __APPLE__ && defined __MACH__))
+ /* Initialize for all encodings: UTF-8, EUC, etc. */
+ union { mbstate_t m; struct { uchar_t ch; int utf8_want, euc_want; } s; } u;
+ u.s.ch = u.s.utf8_want = u.s.euc_want = 0;
+# define mbs u.m
+#elif defined __NetBSD__
+ union { mbstate_t m; struct _RuneLocale *s; } u;
+ u.s = nullptr;
+# define mbs u.m
+#else
+ /* mbstate_t has unknown structure or is not worth optimizing. */
+ mbstate_t mbs = {0};
+#endif
+
+ char32_t ch;
+ size_t len = mbrtoc32 (&ch, p, lim - p, &mbs);
+
+ /* Any LEN with top bit set is an encoding error, as LEN == (size_t) -3
+ is not supported and MB_LEN_MAX <= (size_t) -1 / 2 on all platforms. */
+ static_assert (MB_LEN_MAX <= (size_t) -1 / 2);
+ if ((size_t) -1 / 2 < len)
+ return (mbcel_t) { .err = *p, .len = 1 };
+
+ /* Tell the compiler LEN is at most MB_LEN_MAX,
+ as this can help GCC generate better code. */
+ if (! (len <= MB_LEN_MAX))
+ unreachable ();
+
+ /* A multi-byte character. LEN must be positive,
+ as *P != '\0' and shift sequences are not supported. */
+ return (mbcel_t) { .ch = ch, .len = len };
+}
+
+_GL_INLINE_HEADER_END
+
+#endif /* _MBCEL_H */
diff -pruN b/gltests/bench-mbcel.c c/gltests/bench-mbcel.c
--- b/gltests/bench-mbcel.c 1969-12-31 16:00:00.000000000 -0800
+++ c/gltests/bench-mbcel.c 2023-07-12 09:19:43.667072433 -0700
@@ -0,0 +1,136 @@
+/* Benchmarks for the mbiter module.
+ Copyright (C) 2023 Free Software Foundation, Inc.
+
+ This program is free software: you can redistribute it and/or modify
+ it under the terms of the GNU General Public License as published by
+ the Free Software Foundation, either version 3 of the License, or
+ (at your option) any later version.
+
+ This program is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU General Public License for more details.
+
+ You should have received a copy of the GNU General Public License
+ along with this program. If not, see <https://www.gnu.org/licenses/>. */
+
+#include <config.h>
+
+#include <stdlib.h>
+#include <string.h>
+#include <locale.h>
+#include <uchar.h>
+
+#include "bench.h"
+#include "bench-multibyte.h"
+#include "mbcel.h"
+
+unsigned long long volatile result;
+
+static void
+do_test (char test, int repeat, const char *locale_name, const char *text)
+{
+ printf ("Test %c\n", test);
+ if (setlocale (LC_ALL, locale_name) != NULL)
+ {
+ struct timings_state ts;
+ timing_start (&ts);
+
+ size_t text_len = strlen (text);
+ int count;
+ for (count = 0; count < repeat; count++)
+ {
+ unsigned long long sum = 0;
+ char const *p = text;
+ char const *lim = text + text_len;
+ for (mbcel_t g; p < lim; p += g.len)
+ {
+ g = mbcel_scan (p, lim);
+ sum += g.ch;
+ }
+ result = sum;
+ }
+
+ timing_end (&ts);
+ timing_output (&ts);
+ }
+ else
+ {
+ printf ("Skipping test: locale %s not installed.\n", locale_name);
+ }
+ printf ("\n");
+}
+
+/* Performs some or all of the following tests:
+ a - ASCII text, C locale
+ b - ASCII text, UTF-8 locale
+ c - French text, C locale
+ d - French text, ISO-8859-1 locale
+ e - French text, UTF-8 locale
+ f - Greek text, C locale
+ g - Greek text, ISO-8859-7 locale
+ h - Greek text, UTF-8 locale
+ i - Chinese text, UTF-8 locale
+ j - Chinese text, GB18030 locale
+ Pass the tests to be performed as first argument. */
+int
+main (int argc, char *argv[])
+{
+ if (argc != 3)
+ {
+ fprintf (stderr, "Usage: %s TESTS REPETITIONS\n", argv[0]);
+ fprintf (stderr, "Example: %s abcdefghij 100000\n", argv[0]);
+ exit (1);
+ }
+
+ const char *tests = argv[1];
+ int repeat = atoi (argv[2]);
+
+ text_init ();
+
+ /* Execute each test. */
+ size_t i;
+ for (i = 0; i < strlen (tests); i++)
+ {
+ char test = tests[i];
+
+ switch (test)
+ {
+ case 'a':
+ do_test (test, repeat, "C", text_latin_ascii);
+ break;
+ case 'b':
+ do_test (test, repeat, "en_US.UTF-8", text_latin_ascii);
+ break;
+ case 'c':
+ do_test (test, repeat, "C", text_french_iso8859);
+ break;
+ case 'd':
+ do_test (test, repeat, "fr_FR.ISO-8859-1", text_french_iso8859);
+ break;
+ case 'e':
+ do_test (test, repeat, "en_US.UTF-8", text_french_utf8);
+ break;
+ case 'f':
+ do_test (test, repeat, "C", text_greek_iso8859);
+ break;
+ case 'g':
+ do_test (test, repeat, "el_GR.ISO-8859-7", text_greek_iso8859);
+ break;
+ case 'h':
+ do_test (test, repeat, "en_US.UTF-8", text_greek_utf8);
+ break;
+ case 'i':
+ do_test (test, repeat, "en_US.UTF-8", text_chinese_utf8);
+ break;
+ case 'j':
+ do_test (test, repeat, "zh_CN.GB18030", text_chinese_gb18030);
+ break;
+ default:
+ /* Ignore. */
+ ;
+ }
+ }
+
+ return 0;
+}
diff -pruN b/gltests/Makefile.am c/gltests/Makefile.am
--- b/gltests/Makefile.am 2023-07-12 09:13:27.458521888 -0700
+++ c/gltests/Makefile.am 2023-07-12 09:16:32.556852864 -0700
@@ -780,6 +780,10 @@ EXTRA_DIST += bench-mbiter.c bench-multi
## end gnulib module mbiter-bench-tests
+noinst_PROGRAMS += bench-mbcel
+bench_mbcel_LDADD = $(LDADD) $(LIBUNISTRING) $(SETLOCALE_LIB) $(MBRTOWC_LIB) $(LIBC32CONV)
+EXTRA_DIST += bench-mbcel.c bench-multibyte.h bench.h
+
## begin gnulib module mbrtoc32-regular-tests
TESTS += test-mbrtoc32-regular