On 2023-07-11 15:14, Bruno Haible wrote:

The performance problems that I see are:

   - glibc's conversion functions are optimized for long sequences
     (think of iconv()). They are not optimized for short invocations
     (one multibyte character or less). This is a long-standing problem,
     that no one is attacking.

Yes, but let's defer that for now. Attacking it would require rethinking the mbrtowc/mbrtoc32 interface entirely, which is a bigger task.


   - glibc's UTF-8 converter is very slow for texts with many non-ASCII
     characters

Although there are ways to attack that, they're a bigger lift and for now I'd like to defer this too.


   - In the C locale (tests a, c, f), conversions of bytes < 0x80 are
     cheap, whereas conversions of bytes >= 0x80 are expensive, because
     in this code path, glibc returns (size_t)-1 and mbrtoc32.c invokes
     hard_locale.

There are opportunities for performance improvement here in the mbcel world. A new scanning primitive could accept an argument that is the value of MB_LEN_MAX or that is true for the C locale, or whatever. Callers could hoist that out of loops and this could improve speed further.

However, diff already hoists for other reasons and does not use mbcel in single-byte locales, so this sort of thing wouldn't help diff. This, plus the extra complexity the additional scanning primitive would add to mbcel.h, is why I didn't implement such a primitive.


     Can we optimize the need for calling hard_locale so often, somehow?
     Or create a variant of mbrtoc32 that fetches the value of hard_locale
     from some cache (maybe a __thread variable)?
     Or can hard_locale itself be optimized (through dirty, glibc specific
     hacks)?

All possible, I suppose, but as described above I wonder whether it would be worth the effort in the mbcel world.


I do *not* see a performance problem with character encodings such as
ISO-8859-7 or GB18030 (tests g, j): the figures are comparable with UTF-8.

Yes, though mbcel should speed up both UTF-8 and the other encodings, compared to mbiter.


How can this be? The Emacs source code is mostly ASCII, and the figures
above (test a, b) show that for this case, mbiter is well optimized.

Although mbiter is faster on ASCII than on non-ASCII, it's not well-optimized compared to mbcel. On Fedora x86-64 here is the kernel of an mbcel-based loop that merely scans ASCII and adds each byte's numerical value to a sum:

 .L28:  addq    %rax, %rbp
        movl    $1, %eax
        addq    %rax, %rbx
        cmpq    %r12, %rbx
        jnb     .L19
        movsbq  (%rbx), %rax
        testb   %al, %al
        jns     .L28

where %rbp = sum, %rbx = pointer to next byte, and %r12 = pointer just past end of input.

In contrast, with mbiter the kernel is:

 .L24:  movq    136(%rsp), %rax
        movq    112(%rsp), %r14
        movq    $1, 144(%rsp)
        movsbl  (%rbx), %edx
        movb    $1, 152(%rsp)
        leaq    1(%rax), %rbx
        movl    %edx, 156(%rsp)
        movq    %rbx, 136(%rsp)
        addq    %rdx, %r15
        movb    $0, 128(%rsp)
        cmpq    %r14, %rbx
        jnb     .L5
 .L14:  movzbl  (%rbx), %ecx
        movl    %ecx, %eax
        shrb    $5, %al
        andl    $7, %eax
        movl    is_basic_table(,%rax,4), %eax
        shrl    %cl, %eax
        testb   $1, %al
        jne     .L24

where %r15 = sum, %rbx = pointer to next byte, %r14 = pointer just past end of input.


I'd better try to copy the worthy optimizations into mbiter, mbuiter.

I don't see how, offhand. Although I'm sure mbiter can be improved I don't see how it could catch up to mbcel so long as it continues to solve a harder problem than mbcel solves.

Candidates for optimization:

- The C locale handling
   https://sourceware.org/bugzilla/show_bug.cgi?id=19932
   https://sourceware.org/bugzilla/show_bug.cgi?id=29511
   It's now a clear POSIX violation. Would it make sense to get this fixed
   in glibc, so that gnulib's override can be dropped on future glibc
   versions?

Absolutely. Does наб's patch in the latter bug report look good to you? If so, I can propose it on libc-alpha after the next release (as it's too late for the next release).


- Resetting an mbstate_t: Should we define a function
      void mbszero (mbstate_t *);
   that clears the relevant part of an mbstate_t (i.e. 24 bytes instead
   of 128 bytes on BSD systems)?
   Advantage: performance.
   Drawback: Yet another gnulib-invented, nonstandard API.

It's likely worth it for mbcel on BSDish hosts. Quite possibly it's also worth it for mbiter and mbuiter. Not sure it's worth it everywhere.

Also I still think it's at most 12 bytes, not 24, though I guess that thread <https://lists.gnu.org/r/bug-gnulib/2023-07/msg00044.html> is still ongoing.


- Is a functional interface faster than one that gets a 'struct' passed
  by reference? I would guess no, since gcc optimizes both cases well,
  especially when inlining. But feel free to prove me wrong.

In my benchmarks mbcel is significantly faster than mbiter on current (commit 6a455aab6ef2c6efd6703cfa9319268b503a8e14) Gnulib benchmarks, even when mbrtoc32-regular is used. You should be able to verify this by running:

./gnulib-tool --create-testdir --dir b mbrtoc32-regular mbiter-bench-tests

and then applying the attached patch, running "make check", and then running these commands in the gltests subdirectory:

  ./bench-mbiter abcdefghij 100000
  ./bench-mbcel  abcdefghij 100000

Here's a summary of the results I got on Fedora 38 x86-64 on an AMD Phenom II X4 910e processor dated 2010.

 user CPU sec   speedup
 mbiter  mbcel  factor  test
  1.735  0.478  3.630   a - ASCII text, C locale
  1.703  0.447  3.810   b - ASCII text, UTF-8 locale
  3.852  1.514  2.544   c - French text, C locale
  3.544  1.600  2.215   d - French text, ISO-8859-1 locale
  3.651  1.662  2.197   e - French text, UTF-8 locale
 26.787 15.115  1.772   f - Greek text, C locale
 21.651 17.106  1.266   g - Greek text, ISO-8859-7 locale
 22.565 17.633  1.280   h - Greek text, UTF-8 locale
 10.011  8.051  1.243   i - Chinese text, UTF-8 locale
  9.787  7.967  1.228   j - Chinese text, GB18030 locale

With a better CPU (a Xeon W-1350 dated 2021) and a slightly-slower OS (Ubuntu 23.04) I got these numbers:

 user CPU sec   speedup
 mbiter  mbcel  factor  test
  0.531  0.238  2.231   a - ASCII text, C locale
  0.478  0.187  2.556   b - ASCII text, UTF-8 locale
  1.262  0.510  2.475   c - French text, C locale
  1.121  0.529  2.119   d - French text, ISO-8859-1 locale
  1.080  0.571  1.891   e - French text, UTF-8 locale
 10.349  5.876  1.761   f - Greek text, C locale
  8.530  6.537  1.305   g - Greek text, ISO-8859-7 locale
  8.407  6.506  1.292   h - Greek text, UTF-8 locale
  3.427  2.578  1.329   i - Chinese text, UTF-8 locale
  3.279  2.489  1.317   j - Chinese text, GB18030 locale
diff -pruN b/gllib/mbcel.h c/gllib/mbcel.h
--- b/gllib/mbcel.h	1969-12-31 16:00:00.000000000 -0800
+++ c/gllib/mbcel.h	2023-07-12 09:16:09.705062277 -0700
@@ -0,0 +1,142 @@
+/* Multi-byte characters, error encodings, and lengths
+   Copyright 2023 Free Software Foundation, Inc.
+
+   This file is free software: you can redistribute it and/or modify
+   it under the terms of the GNU Lesser General Public License as
+   published by the Free Software Foundation; either version 2.1 of the
+   License, or (at your option) any later version.
+
+   This file is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public License
+   along with this program.  If not, see <https://www.gnu.org/licenses/>.  */
+
+/* Written by Paul Eggert.  */
+
+/* The mbcel_scan function lets code iterate through an array of bytes,
+   supporting character encodings in practical use
+   more simply than using plain mbrtoc32.
+
+   Instead of this single-byte code:
+
+      char *p = ..., *lim = ...;
+      for (; p < lim; p++)
+        process (*p);
+
+   You can use this multi-byte code:
+
+      char *p = ..., *lim = ...;
+      for (mbcel_t g; p < lim; p += g.len)
+        {
+	  g = mbcel_scan (p, lim);
+	  process (g);
+	}
+
+   You can select from G using G.ch, G.err, and G.len.
+
+   Although ISO C and POSIX allow encodings that have shift states or
+   that can produce multiple characters from an indivisible byte sequence,
+   POSIX does not require support for these encodings,
+   they are not in practical use on GNUish platforms,
+   and omitting support for them simplifies the API.  */
+
+#ifndef _MBCEL_H
+#define _MBCEL_H 1
+
+/* This file uses _GL_INLINE_HEADER_BEGIN, _GL_INLINE.  */
+#if !_GL_CONFIG_H_INCLUDED
+ #error "Please include config.h first."
+#endif
+
+#include <limits.h>
+#include <stddef.h>
+#include <uchar.h>
+
+/* mbcel_t is a type representing a character CH or an encoding error byte ERR,
+   along with a count of the LEN bytes that represent CH or ERR.
+   If ERR is zero, CH is a valid character and 1 <= LEN <= MB_LEN_MAX;
+   otherwise ERR is an encoding error byte, 0x80 <= ERR <= UCHAR_MAX,
+   CH == 0, and LEN == 1.  */
+typedef struct
+{
+  char32_t ch;
+  unsigned char err;
+  unsigned char len;
+} mbcel_t;
+
+/* On all known platforms, every multi-byte character length fits in
+   mbcel_t's LEN.  Check this.  */
+static_assert (MB_LEN_MAX <= UCHAR_MAX);
+
+/* Pacify GCC re '*p <= 0x7f' below.  */
+#if defined __GNUC__ && 4 < __GNUC__ + (3 <= __GNUC_MINOR__)
+# pragma GCC diagnostic ignored "-Wtype-limits"
+#endif
+
+_GL_INLINE_HEADER_BEGIN
+#ifndef MBCEL_INLINE
+# define MBCEL_INLINE _GL_INLINE
+#endif
+
+/* With diffutils there is no need for the performance overhead of
+   replacing glibc mbrtoc32, as it doesn't matter whether the C locale
+   treats bytes with the high bit set as encoding errors.  */
+#ifdef __GLIBC__
+# undef mbrtoc32
+#endif
+
+/* Scan bytes from P inclusive to LIM exclusive.  P must be less than LIM.
+   Return either the representation of the valid character starting at P,
+   or the representation of an encoding error of length 1 at P.  */
+MBCEL_INLINE mbcel_t
+mbcel_scan (char const *p, char const *lim)
+{
+  /* Handle ASCII quickly to avoid the overhead of calling mbrtoc32.
+     In supported encodings, the first byte of a multi-byte character
+     cannot be an ASCII byte.  */
+  if (0 <= *p && *p <= 0x7f)
+    return (mbcel_t) { .ch = *p, .len = 1 };
+
+  /* An initial mbstate_t; initialization optimized for some platforms.  */
+#if defined __GLIBC__ && 2 < __GLIBC__ + (2 <= __GLIBC_MINOR__)
+  mbstate_t mbs; mbs.__count = 0;
+#elif (defined __FreeBSD__ || defined __DragonFly__ || defined __OpenBSD__ \
+       || (defined __APPLE__ && defined __MACH__))
+  /* Initialize for all encodings: UTF-8, EUC, etc.  */
+  union { mbstate_t m; struct { uchar_t ch; int utf8_want, euc_want; } s; } u;
+  u.s.ch = u.s.utf8_want = u.s.euc_want = 0;
+# define mbs u.m
+#elif defined __NetBSD__
+  union { mbstate_t m; struct _RuneLocale *s; } u;
+  u.s = nullptr;
+# define mbs u.m
+#else
+  /* mbstate_t has unknown structure or is not worth optimizing.  */
+  mbstate_t mbs = {0};
+#endif
+
+  char32_t ch;
+  size_t len = mbrtoc32 (&ch, p, lim - p, &mbs);
+
+  /* Any LEN with top bit set is an encoding error, as LEN == (size_t) -3
+     is not supported and MB_LEN_MAX <= (size_t) -1 / 2 on all platforms.  */
+  static_assert (MB_LEN_MAX <= (size_t) -1 / 2);
+  if ((size_t) -1 / 2 < len)
+    return (mbcel_t) { .err = *p, .len = 1 };
+
+  /* Tell the compiler LEN is at most MB_LEN_MAX,
+     as this can help GCC generate better code.  */
+  if (! (len <= MB_LEN_MAX))
+    unreachable ();
+
+  /* A multi-byte character.  LEN must be positive,
+     as *P != '\0' and shift sequences are not supported.  */
+  return (mbcel_t) { .ch = ch, .len = len };
+}
+
+_GL_INLINE_HEADER_END
+
+#endif /* _MBCEL_H */
diff -pruN b/gltests/bench-mbcel.c c/gltests/bench-mbcel.c
--- b/gltests/bench-mbcel.c	1969-12-31 16:00:00.000000000 -0800
+++ c/gltests/bench-mbcel.c	2023-07-12 09:19:43.667072433 -0700
@@ -0,0 +1,136 @@
+/* Benchmarks for the mbiter module.
+   Copyright (C) 2023 Free Software Foundation, Inc.
+
+   This program is free software: you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation, either version 3 of the License, or
+   (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program.  If not, see <https://www.gnu.org/licenses/>.  */
+
+#include <config.h>
+
+#include <stdlib.h>
+#include <string.h>
+#include <locale.h>
+#include <uchar.h>
+
+#include "bench.h"
+#include "bench-multibyte.h"
+#include "mbcel.h"
+
+unsigned long long volatile result;
+
+static void
+do_test (char test, int repeat, const char *locale_name, const char *text)
+{
+  printf ("Test %c\n", test);
+  if (setlocale (LC_ALL, locale_name) != NULL)
+    {
+      struct timings_state ts;
+      timing_start (&ts);
+
+      size_t text_len = strlen (text);
+      int count;
+      for (count = 0; count < repeat; count++)
+        {
+          unsigned long long sum = 0;
+          char const *p = text;
+          char const *lim = text + text_len;
+          for (mbcel_t g; p < lim; p += g.len)
+            {
+              g = mbcel_scan (p, lim);
+              sum += g.ch;
+            }
+          result = sum;
+        }
+
+      timing_end (&ts);
+      timing_output (&ts);
+    }
+  else
+    {
+      printf ("Skipping test: locale %s not installed.\n", locale_name);
+    }
+  printf ("\n");
+}
+
+/* Performs some or all of the following tests:
+     a - ASCII text, C locale
+     b - ASCII text, UTF-8 locale
+     c - French text, C locale
+     d - French text, ISO-8859-1 locale
+     e - French text, UTF-8 locale
+     f - Greek text, C locale
+     g - Greek text, ISO-8859-7 locale
+     h - Greek text, UTF-8 locale
+     i - Chinese text, UTF-8 locale
+     j - Chinese text, GB18030 locale
+   Pass the tests to be performed as first argument.  */
+int
+main (int argc, char *argv[])
+{
+  if (argc != 3)
+    {
+      fprintf (stderr, "Usage: %s TESTS REPETITIONS\n", argv[0]);
+      fprintf (stderr, "Example: %s abcdefghij 100000\n", argv[0]);
+      exit (1);
+    }
+
+  const char *tests = argv[1];
+  int repeat = atoi (argv[2]);
+
+  text_init ();
+
+  /* Execute each test.  */
+  size_t i;
+  for (i = 0; i < strlen (tests); i++)
+    {
+      char test = tests[i];
+
+      switch (test)
+        {
+        case 'a':
+          do_test (test, repeat, "C", text_latin_ascii);
+          break;
+        case 'b':
+          do_test (test, repeat, "en_US.UTF-8", text_latin_ascii);
+          break;
+        case 'c':
+          do_test (test, repeat, "C", text_french_iso8859);
+          break;
+        case 'd':
+          do_test (test, repeat, "fr_FR.ISO-8859-1", text_french_iso8859);
+          break;
+        case 'e':
+          do_test (test, repeat, "en_US.UTF-8", text_french_utf8);
+          break;
+        case 'f':
+          do_test (test, repeat, "C", text_greek_iso8859);
+          break;
+        case 'g':
+          do_test (test, repeat, "el_GR.ISO-8859-7", text_greek_iso8859);
+          break;
+        case 'h':
+          do_test (test, repeat, "en_US.UTF-8", text_greek_utf8);
+          break;
+        case 'i':
+          do_test (test, repeat, "en_US.UTF-8", text_chinese_utf8);
+          break;
+        case 'j':
+          do_test (test, repeat, "zh_CN.GB18030", text_chinese_gb18030);
+          break;
+        default:
+          /* Ignore.  */
+          ;
+        }
+    }
+
+  return 0;
+}
diff -pruN b/gltests/Makefile.am c/gltests/Makefile.am
--- b/gltests/Makefile.am	2023-07-12 09:13:27.458521888 -0700
+++ c/gltests/Makefile.am	2023-07-12 09:16:32.556852864 -0700
@@ -780,6 +780,10 @@ EXTRA_DIST += bench-mbiter.c bench-multi
 
 ## end   gnulib module mbiter-bench-tests
 
+noinst_PROGRAMS += bench-mbcel
+bench_mbcel_LDADD = $(LDADD) $(LIBUNISTRING) $(SETLOCALE_LIB) $(MBRTOWC_LIB) $(LIBC32CONV)
+EXTRA_DIST += bench-mbcel.c bench-multibyte.h bench.h
+
 ## begin gnulib module mbrtoc32-regular-tests
 
 TESTS += test-mbrtoc32-regular

Reply via email to