GNU diffutils has long lacked support for multi-byte locales for options like --ignore-case (-i) and --ignore-space-change (-b), and the recent char32_t changes to diffutils/src/side.c inspired me to fix this. As it was a pain for diffutils to use mbrtoc32 directly, I looked into using Gnulib's mbiter module to iterate through diffutils input. However, mbiter's generality had a performance penalty.

Some of the performance penalty is due to Gnulib's mbrtoc32 module replacing mbrtoc32 on glibc. As I understand it, this is due to glibc's mishandling of the C locale (it treats non-ASCII bytes as encoding errors). Such a bug should not affect diffutils, as diffutils uses mbrtoc32 only in multi-byte locales. So I'd like a way for diffutils to use the mbrtoc32 module without replacing mbrtoc32 on glibc. In the patch I just installed into diffutils on Savannah, this is done via a conditional "#undef mbrtoc32" (see attached) but this is a hack and there should be a better way.

More of the performance penalty appears to be the mbiter module's support for arbitrary character encodings that don't happen in practice - or at least if they do happen they're so rare that diffutils need not worry about them. To work around this problem I wrote a simple, fast iterator "mbcel" that I hope works on all the platforms Gnulib normally targets. mbcel uses a functional style (that is, its only function mbcel_scan is pure in the GCC sense, with no side effects), and this should help make calling code clearer (and I hope, more efficient).

I timed mbcel on the Emacs source code and it scanned the input significantly faster than mbiter did. So I installed it into diffutils on Savannah, as part of diffutils's new support for multi-byte locales.

I'm thinking that mbcel would be useful in Gnulib and in other GNU programs, and that we should create a mbcel module for it in Gnulib. A copy of its only file lib/mbcel.h is attached. The idea is to have an option that is simple and fast, albeit not portable to theoretical platforms.

/* Multi-byte characters, error encodings, and lengths
   Copyright 2023 Free Software Foundation, Inc.

   This file is free software: you can redistribute it and/or modify
   it under the terms of the GNU Lesser General Public License as
   published by the Free Software Foundation; either version 2.1 of the
   License, or (at your option) any later version.

   This file is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
   GNU Lesser General Public License for more details.

   You should have received a copy of the GNU Lesser General Public License
   along with this program.  If not, see <https://www.gnu.org/licenses/>.  */

/* Written by Paul Eggert.  */

/* The mbcel_scan function lets code iterate through an array of bytes,
   supporting character encodings in practical use
   more simply than using plain mbrtoc32.

   Instead of this single-byte code:

      char *p = ..., *lim = ...;
      for (; p < lim; p++)
        process (*p);

   You can use this multi-byte code:

      char *p = ..., *lim = ...;
      for (mbcel_t g; p < lim; p += g.len)
        {
	  g = mbcel_scan (p, lim);
	  process (g);
	}

   You can select from G using G.ch, G.err, and G.len.

   Although ISO C and POSIX allow encodings that have shift states or
   that can produce multiple characters from an indivisible byte sequence,
   POSIX does not require support for these encodings,
   they are not in practical use on GNUish platforms,
   and omitting support for them simplifies the API.  */

#ifndef _MBCEL_H
#define _MBCEL_H 1

/* This file uses _GL_INLINE_HEADER_BEGIN, _GL_INLINE.  */
#if !_GL_CONFIG_H_INCLUDED
 #error "Please include config.h first."
#endif

#include <uchar.h>

/* mbcel_t is a type representing a character CH or an encoding error byte ERR,
   along with a count of the LEN bytes that represent CH or ERR.
   If ERR is zero, CH is a valid character and 1 <= LEN <= MB_LEN_MAX;
   otherwise ERR is an encoding error byte, 0x80 <= ERR <= UCHAR_MAX,
   CH == 0, and LEN == 1.  */
typedef struct
{
  char32_t ch;
  unsigned char err;
  unsigned char len;
} mbcel_t;

/* Pacify GCC re '*p <= 0x7f' below.  */
#if defined __GNUC__ && 4 < __GNUC__ + (3 <= __GNUC_MINOR__)
# pragma GCC diagnostic ignored "-Wtype-limits"
#endif

_GL_INLINE_HEADER_BEGIN
#ifndef MBCEL_INLINE
# define MBCEL_INLINE _GL_INLINE
#endif

/* With diffutils there is no need for the performance overhead of
   replacing glibc mbrtoc32, as it doesn't matter whether the C locale
   treats bytes with the high bit set as encoding errors.  */
#ifdef __GLIBC__
# undef mbrtoc32
#endif

/* Scan bytes from P inclusive to LIM exclusive.  P must be less than LIM.
   Return either the representation of the valid character starting at P,
   or the representation of an encoding error of length 1 at P.  */
MBCEL_INLINE mbcel_t
mbcel_scan (char const *p, char const *lim)
{
  /* Handle ASCII quickly to avoid the overhead of calling mbrtoc32.
     In supported encodings, the first byte of a multi-byte character
     cannot be an ASCII byte.  */
  if (0 <= *p && *p <= 0x7f)
    return (mbcel_t) { .ch = *p, .len = 1 };

  char32_t ch;
  mbstate_t mbs = {0};
  size_t len = mbrtoc32 (&ch, p, lim - p, &mbs);

  /* Any LEN with top bit set is an encoding error, as LEN == (size_t) -3
     is not supported and MB_LEN_MAX <= (size_t) -1 / 2 on all platforms.  */
  if ((size_t) -1 / 2 < len)
    return (mbcel_t) { .err = *p, .len = 1 };

  /* A multi-byte character.  LEN must be positive,
     as *P != '\0' and shift sequences are not supported.  */
  return (mbcel_t) { .ch = ch, .len = len };
}

_GL_INLINE_HEADER_END

#endif /* _MBCEL_H */

Reply via email to