GNU diffutils has long lacked support for multi-byte locales for options
like --ignore-case (-i) and --ignore-space-change (-b), and the recent
char32_t changes to diffutils/src/side.c inspired me to fix this. As it
was a pain for diffutils to use mbrtoc32 directly, I looked into using
Gnulib's mbiter module to iterate through diffutils input. However,
mbiter's generality had a performance penalty.
Some of the performance penalty is due to Gnulib's mbrtoc32 module
replacing mbrtoc32 on glibc. As I understand it, this is due to glibc's
mishandling of the C locale (it treats non-ASCII bytes as encoding
errors). Such a bug should not affect diffutils, as diffutils uses
mbrtoc32 only in multi-byte locales. So I'd like a way for diffutils to
use the mbrtoc32 module without replacing mbrtoc32 on glibc. In the
patch I just installed into diffutils on Savannah, this is done via a
conditional "#undef mbrtoc32" (see attached) but this is a hack and
there should be a better way.
More of the performance penalty appears to be the mbiter module's
support for arbitrary character encodings that don't happen in practice
- or at least if they do happen they're so rare that diffutils need not
worry about them. To work around this problem I wrote a simple, fast
iterator "mbcel" that I hope works on all the platforms Gnulib normally
targets. mbcel uses a functional style (that is, its only function
mbcel_scan is pure in the GCC sense, with no side effects), and this
should help make calling code clearer (and I hope, more efficient).
I timed mbcel on the Emacs source code and it scanned the input
significantly faster than mbiter did. So I installed it into diffutils
on Savannah, as part of diffutils's new support for multi-byte locales.
I'm thinking that mbcel would be useful in Gnulib and in other GNU
programs, and that we should create a mbcel module for it in Gnulib. A
copy of its only file lib/mbcel.h is attached. The idea is to have an
option that is simple and fast, albeit not portable to theoretical
platforms.
/* Multi-byte characters, error encodings, and lengths
Copyright 2023 Free Software Foundation, Inc.
This file is free software: you can redistribute it and/or modify
it under the terms of the GNU Lesser General Public License as
published by the Free Software Foundation; either version 2.1 of the
License, or (at your option) any later version.
This file is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>. */
/* Written by Paul Eggert. */
/* The mbcel_scan function lets code iterate through an array of bytes,
supporting character encodings in practical use
more simply than using plain mbrtoc32.
Instead of this single-byte code:
char *p = ..., *lim = ...;
for (; p < lim; p++)
process (*p);
You can use this multi-byte code:
char *p = ..., *lim = ...;
for (mbcel_t g; p < lim; p += g.len)
{
g = mbcel_scan (p, lim);
process (g);
}
You can select from G using G.ch, G.err, and G.len.
Although ISO C and POSIX allow encodings that have shift states or
that can produce multiple characters from an indivisible byte sequence,
POSIX does not require support for these encodings,
they are not in practical use on GNUish platforms,
and omitting support for them simplifies the API. */
#ifndef _MBCEL_H
#define _MBCEL_H 1
/* This file uses _GL_INLINE_HEADER_BEGIN, _GL_INLINE. */
#if !_GL_CONFIG_H_INCLUDED
#error "Please include config.h first."
#endif
#include <uchar.h>
/* mbcel_t is a type representing a character CH or an encoding error byte ERR,
along with a count of the LEN bytes that represent CH or ERR.
If ERR is zero, CH is a valid character and 1 <= LEN <= MB_LEN_MAX;
otherwise ERR is an encoding error byte, 0x80 <= ERR <= UCHAR_MAX,
CH == 0, and LEN == 1. */
typedef struct
{
char32_t ch;
unsigned char err;
unsigned char len;
} mbcel_t;
/* Pacify GCC re '*p <= 0x7f' below. */
#if defined __GNUC__ && 4 < __GNUC__ + (3 <= __GNUC_MINOR__)
# pragma GCC diagnostic ignored "-Wtype-limits"
#endif
_GL_INLINE_HEADER_BEGIN
#ifndef MBCEL_INLINE
# define MBCEL_INLINE _GL_INLINE
#endif
/* With diffutils there is no need for the performance overhead of
replacing glibc mbrtoc32, as it doesn't matter whether the C locale
treats bytes with the high bit set as encoding errors. */
#ifdef __GLIBC__
# undef mbrtoc32
#endif
/* Scan bytes from P inclusive to LIM exclusive. P must be less than LIM.
Return either the representation of the valid character starting at P,
or the representation of an encoding error of length 1 at P. */
MBCEL_INLINE mbcel_t
mbcel_scan (char const *p, char const *lim)
{
/* Handle ASCII quickly to avoid the overhead of calling mbrtoc32.
In supported encodings, the first byte of a multi-byte character
cannot be an ASCII byte. */
if (0 <= *p && *p <= 0x7f)
return (mbcel_t) { .ch = *p, .len = 1 };
char32_t ch;
mbstate_t mbs = {0};
size_t len = mbrtoc32 (&ch, p, lim - p, &mbs);
/* Any LEN with top bit set is an encoding error, as LEN == (size_t) -3
is not supported and MB_LEN_MAX <= (size_t) -1 / 2 on all platforms. */
if ((size_t) -1 / 2 < len)
return (mbcel_t) { .err = *p, .len = 1 };
/* A multi-byte character. LEN must be positive,
as *P != '\0' and shift sequences are not supported. */
return (mbcel_t) { .ch = ch, .len = len };
}
_GL_INLINE_HEADER_END
#endif /* _MBCEL_H */