mbcel module for Gnulib?

Paul Eggert Sun, 09 Jul 2023 02:21:55 -0700

GNU diffutils has long lacked support for multi-byte locales for optionslike --ignore-case (-i) and --ignore-space-change (-b), and the recentchar32_t changes to diffutils/src/side.c inspired me to fix this. As itwas a pain for diffutils to use mbrtoc32 directly, I looked into usingGnulib's mbiter module to iterate through diffutils input. However,mbiter's generality had a performance penalty.

Some of the performance penalty is due to Gnulib's mbrtoc32 modulereplacing mbrtoc32 on glibc. As I understand it, this is due to glibc'smishandling of the C locale (it treats non-ASCII bytes as encodingerrors). Such a bug should not affect diffutils, as diffutils usesmbrtoc32 only in multi-byte locales. So I'd like a way for diffutils touse the mbrtoc32 module without replacing mbrtoc32 on glibc. In thepatch I just installed into diffutils on Savannah, this is done via aconditional "#undef mbrtoc32" (see attached) but this is a hack andthere should be a better way.

More of the performance penalty appears to be the mbiter module'ssupport for arbitrary character encodings that don't happen in practice- or at least if they do happen they're so rare that diffutils need notworry about them. To work around this problem I wrote a simple, fastiterator "mbcel" that I hope works on all the platforms Gnulib normallytargets. mbcel uses a functional style (that is, its only functionmbcel_scan is pure in the GCC sense, with no side effects), and thisshould help make calling code clearer (and I hope, more efficient).

I timed mbcel on the Emacs source code and it scanned the inputsignificantly faster than mbiter did. So I installed it into diffutilson Savannah, as part of diffutils's new support for multi-byte locales.

I'm thinking that mbcel would be useful in Gnulib and in other GNUprograms, and that we should create a mbcel module for it in Gnulib. Acopy of its only file lib/mbcel.h is attached. The idea is to have anoption that is simple and fast, albeit not portable to theoreticalplatforms.

/* Multi-byte characters, error encodings, and lengths
   Copyright 2023 Free Software Foundation, Inc.

   This file is free software: you can redistribute it and/or modify
   it under the terms of the GNU Lesser General Public License as
   published by the Free Software Foundation; either version 2.1 of the
   License, or (at your option) any later version.

   This file is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
   GNU Lesser General Public License for more details.

   You should have received a copy of the GNU Lesser General Public License
   along with this program.  If not, see <https://www.gnu.org/licenses/>.  */

/* Written by Paul Eggert.  */

/* The mbcel_scan function lets code iterate through an array of bytes,
   supporting character encodings in practical use
   more simply than using plain mbrtoc32.

   Instead of this single-byte code:

      char *p = ..., *lim = ...;
      for (; p < lim; p++)
        process (*p);

   You can use this multi-byte code:

      char *p = ..., *lim = ...;
      for (mbcel_t g; p < lim; p += g.len)
        {
	  g = mbcel_scan (p, lim);
	  process (g);
	}

   You can select from G using G.ch, G.err, and G.len.

   Although ISO C and POSIX allow encodings that have shift states or
   that can produce multiple characters from an indivisible byte sequence,
   POSIX does not require support for these encodings,
   they are not in practical use on GNUish platforms,
   and omitting support for them simplifies the API.  */

#ifndef _MBCEL_H
#define _MBCEL_H 1

/* This file uses _GL_INLINE_HEADER_BEGIN, _GL_INLINE.  */
#if !_GL_CONFIG_H_INCLUDED
 #error "Please include config.h first."
#endif

#include <uchar.h>

/* mbcel_t is a type representing a character CH or an encoding error byte ERR,
   along with a count of the LEN bytes that represent CH or ERR.
   If ERR is zero, CH is a valid character and 1 <= LEN <= MB_LEN_MAX;
   otherwise ERR is an encoding error byte, 0x80 <= ERR <= UCHAR_MAX,
   CH == 0, and LEN == 1.  */
typedef struct
{
  char32_t ch;
  unsigned char err;
  unsigned char len;
} mbcel_t;

/* Pacify GCC re '*p <= 0x7f' below.  */
#if defined __GNUC__ && 4 < __GNUC__ + (3 <= __GNUC_MINOR__)
# pragma GCC diagnostic ignored "-Wtype-limits"
#endif

_GL_INLINE_HEADER_BEGIN
#ifndef MBCEL_INLINE
# define MBCEL_INLINE _GL_INLINE
#endif

/* With diffutils there is no need for the performance overhead of
   replacing glibc mbrtoc32, as it doesn't matter whether the C locale
   treats bytes with the high bit set as encoding errors.  */
#ifdef __GLIBC__
# undef mbrtoc32
#endif

/* Scan bytes from P inclusive to LIM exclusive.  P must be less than LIM.
   Return either the representation of the valid character starting at P,
   or the representation of an encoding error of length 1 at P.  */
MBCEL_INLINE mbcel_t
mbcel_scan (char const *p, char const *lim)
{
  /* Handle ASCII quickly to avoid the overhead of calling mbrtoc32.
     In supported encodings, the first byte of a multi-byte character
     cannot be an ASCII byte.  */
  if (0 <= *p && *p <= 0x7f)
    return (mbcel_t) { .ch = *p, .len = 1 };

  char32_t ch;
  mbstate_t mbs = {0};
  size_t len = mbrtoc32 (&ch, p, lim - p, &mbs);

  /* Any LEN with top bit set is an encoding error, as LEN == (size_t) -3
     is not supported and MB_LEN_MAX <= (size_t) -1 / 2 on all platforms.  */
  if ((size_t) -1 / 2 < len)
    return (mbcel_t) { .err = *p, .len = 1 };

  /* A multi-byte character.  LEN must be positive,
     as *P != '\0' and shift sequences are not supported.  */
  return (mbcel_t) { .ch = ch, .len = len };
}

_GL_INLINE_HEADER_END

#endif /* _MBCEL_H */

mbcel module for Gnulib?

Reply via email to