On 2023-07-20 08:28, Bruno Haible wrote:
e.g., it worries about mbrtoc32 returning (size_t) -3,
This makes only for a small performance difference.
Yes, if done right, it matters only when input contains encoding errors.
or returning (size_t) -1 in the C locale.
Indeed, this shows as a difference between mbiterf and mbcel in the
test cases c, f:
mbiterf mbcel mbuiterf
c 1.145 0.670 1.179
f 13.028 5.714 14.654
But since the glibc people are already working on resolving this issue,
Although that'll help quite a bit, it won't be enough to overcome the
performance overhead. mbiter, mbiterf, mbuiter, and mbuiterf (which I'll
shorten to "mbu?iterf?") carry around more stuff, and it'll likely be
significant overhead even after the issue is resolved. For now, anyway,
the overhead is large enough that mbcel is a winner for diff.
Even mbcel strains GCC's capabilities to optimize. When I look at the
x86-64 code it generates I see clear opportunities for tighter code. If
I have time I'll fire off a suggestion or two to the GCC maintainers.
With luck these suggestions would also help mbu?iterf?.
The mbu?iterf? overhead would be OK if it bought something for diff. But
it doesn't. And I suspect we'll see something similar in other GNU code
that has held off on full support for multi-byte encodings due to
worries about complexity and performance.
The issue here is not just performance, but complexity. See the attached
patch, which I just installed into diffutils. It has a function
mbcel_strcasecmp which has the same behavior as Gnulib's mbuiterf-using
mbscasecmp, and with multi-byte source code that is significantly
simpler than the mbuiterf code. Here's the mbcel version:
while (true)
{
mbcel_t g1 = mbcel_scanz (p1); p1 += g1.len;
mbcel_t g2 = mbcel_scanz (p2); p2 += g2.len;
int cmp = mbcel_casecmp (g1, g2);
if (_GL_UNLIKELY (cmp | ! (g1.ch | g1.err)))
return cmp;
}
and the mbuiterf equivalent:
mbuif_state_t state1;
const char *iter1;
mbuif_init (state1);
iter1 = s1;
mbuif_state_t state2;
const char *iter2;
mbuif_init (state2);
iter2 = s2;
while (mbuif_avail (state1, iter1) && mbuif_avail (state2, iter2))
{
mbchar_t cur1 = mbuif_next (state1, iter1);
mbchar_t cur2 = mbuif_next (state2, iter2);
int cmp = mb_casecmp (cur1, cur2);
if (cmp != 0)
return cmp;
iter1 += mb_len (cur1);
iter2 += mb_len (cur2);
}
if (mbuif_avail (state1, iter1))
/* s2 terminated before s1. */
return 1;
if (mbuif_avail (state2, iter2))
/* s1 terminated before s2. */
return -1;
return 0;
On my platform the machine code for mbcel_strcasecmp (including its
single-byte code) is 676 bytes, whereas mbscasecmp's contains 1230 bytes
(1055 bytes if compiled with NDEBUG). And here is the list of functions
that the two implementations call, along with the static number of calls
(C for mbcel, I for mbuiterf, J for mbuiterf with NDEBUG):
C I J
3 __assert_fail
1 3 3 __ctype_get_mb_cur_max
1 1 1 __ctype_to_lower_loc
2 3 3 mbrtoc32 / rpl_mbrtoc32
4 2 mbsinit
3 3 memcmp
2 2 strlen
3 3 strnlen1
2 2 2 towlower
Although one could tune the mbuiterf version to make it simpler and
faster (and I'm mentioning the above functions to help that effort, if
you want to take the time), I don't see how it ever could match the
mcbel version, given its more-complex API.
Since the two versions have the same behavior and the mbcel version is
simpler and faster, I'd like to replace mbscasecmp's mbuiterf version
with the mbcel version in Gnulib. If that can't be done for some reason,
then let's have a switch to let the Gnulib user pick one implementation
or the other.
The other significant difference that I see is the handling of multibyte
sequences. When there 2 or 3 bytes (of, say, UTF-8) that constitute an
incomplete multibyte character at the end of the string,
This isn't a problem for programs like grep and diff, where there's
always a newline at the end the input buffer.
I disagree: Any program can run into it when the input is
<some valid UTF-8 characters><an incomplete UTF-8 character><newline>
The newline is part of the string - at least, that's how grep and diff
do things.
My screenshot from the 'src/diff -y -t' output in an xterm also shows
that there is an issue.
That's an issue of display; it's not an issue of how grep scans or
outputs the data. In code that simply scans, this issue does not matter.
For example, mbscasecmp behaves the same regardless of whether mbcel or
mbuiterf is used.
Sure. Some programs then treat that error as if an U+FFFD character
had been read.
- ISO 10646 says ([1] section R.7) "it shall interpret that malformed
sequence in the same way that it interprets a character that is outside
the adopted subset".
If I understand this requirement correctly mbcel satisfies it, as mbcel
treats those two things in the same way, namely, as sequences of
encoding error bytes.
No, I don't think mbcel satisfies it, since mbcel interprets the
"malformed sequence" not like "a character" but like multiple characters.
No, mbcel interprets it as a malformed byte sequence. It's not a
character, nor is it multiple characters. It is a sequence of encoding
error bytes, and it's up to the caller to deal with that sequence as it
chooses.
diff uses mbcel to output the malformed sequence as-is. That's how diff
signals encoding errors. This is common practice, not just for diff but
for many other applications. Of course this method has issues, just like
any method of signaling encoding errors does, but it's simple, it
doesn't lose information, it's a local sweet spot in how to deal with
encoding errors, and I don't see any serious suggestion to change how
diff behaves in this respect. Since mbcel supports this usage, diff
doesn't need the extra features provided by mbu?iterf?.
The Unicode Standard has several pages about this topic:
Unicode 15.0 section 3.9
https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf pages 124..129
That section is about converters. mbcel and mbiter are not converters:
they're scanners. You can use mbcel (or mbiter) to build a converter
that conforms to the standard in any way you like. You can also use
either of them in code that does not convert (e.g., mbscasecmp). So this
disagreement is not about mbcel itself, but how mbcel is used as part of
a larger application.
diff also is not a converter in the sense of the standard. It does not
convert UTF-8 text to a different encoding.
Also, the standard explicitly allows converters to treat each
encoding-error byte as a separate unit. It gives this as an example on
page 127, where it says "... in processing the UTF-8 code unit sequence
<F0 80 80 41>.... The converter could return ... <U+FFFD, U+FFFD,
U+FFFD, U+0041>, handling each byte of <F0 80 80> as a separate error"
among other possible approaches.
(xterm does it right, but you are right that nowadays gnome-terminal and other
vte-based terminal emulators are the majority.)
xterm does not do it "right", at least, not on my platform (Ubuntu
23.04). It does not behave as Kuhn suggested in the last screenful of
Kuhn's example, as I described in a previous email.
If you were to use the 'mbiterf' module instead of mbcel, the
mb_equal macro from mbchar.h does the right thing. Yes, an mb_equal
call is a bit more complicated than the same_ch_err definition that
you have in diffutils/src/io.c. That's the unavoidable consequence of
treating a sequence of 2 or 3 bytes as *one* error.
OK, thanks for letting me know mbiterf can handle the situation.
What does mbiterf do in non-UTF-8 multi-byte locales? How can it tell
how long the invalid sequence is?
According to what I read in the Unicode Standard (above), it's a best practice
for all kinds of applications.
The Unicode standard doesn't say it's best practice. And even if it were
best practice for UTF-8 (which is dubious) I don't see see how it
generalizes to non-UTF-8 multi-byte encodings.
Admittedly these encodings are less important. I had even considered
dropping support for them in GNU diff, as that would simplify
maintenance and improve performance. However, that might be a bridge too
far for a few traditionalist users.
From 6477dce501741b941dd695406ce04c8494887a54 Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Fri, 21 Jul 2023 11:20:46 -0700
Subject: [PATCH] diff: sort multi-byte file names better
* bootstrap.conf (gnulib_modules): Add builtin-expect.
* lib/mbcel-strcasecmp.c: New file.
* lib/Makefile.am (libdiffutils_a_SOURCES): Add it.
* lib/mbcel.h (MBCEL_LEN_MAX, MBCEL_ENCODING_ERROR_SHIFT)
(MBCEL_UCHAR_FITS, MBCEL_UCHAR_EASILY_FITS): New constants.
(_GL_LIKELY): New macro.
(mbcel_scan): Use it. Simplify NetBSD code.
(mbcel_scant, mbcel_scanz, mbcel_cmp, mbcel_casecmp): New functions.
* src/dir.c (strcasecoll): Move defn here from system.h,
since only dir.c needs it. Use mbcel_strcasecmp instead
of strcasecmp.
---
NEWS | 6 +-
bootstrap.conf | 1 +
lib/Makefile.am | 2 +-
lib/mbcel-strcasecmp.c | 63 ++++++++++++++++++
lib/mbcel.h | 145 ++++++++++++++++++++++++++++++++++++-----
src/dir.c | 9 +++
src/system.h | 13 +---
7 files changed, 205 insertions(+), 34 deletions(-)
create mode 100644 lib/mbcel-strcasecmp.c
diff --git a/NEWS b/NEWS
index 192500d..195b630 100644
--- a/NEWS
+++ b/NEWS
@@ -4,9 +4,9 @@ GNU diffutils NEWS -*- outline -*-
** Improvements
- diff's --ignore-case (-i) option now supports multi-byte characters.
- For example, it treats Greek capital Δ like small δ when input
- uses UTF-8.
+ diff's --ignore-case (-i) and --ignore-file-name-case options now
+ support multi-byte characters. For example, they treat Greek
+ capital Δ like small δ when input uses UTF-8.
diff now supports multi-byte characters when treating white space.
In options like --ignore-space-change (-b) and
diff --git a/bootstrap.conf b/bootstrap.conf
index 98f94cf..5f50700 100644
--- a/bootstrap.conf
+++ b/bootstrap.conf
@@ -28,6 +28,7 @@ argmatch
assert-h
attribute
binary-io
+builtin-expect
c-ctype
c-stack
c32isspace
diff --git a/lib/Makefile.am b/lib/Makefile.am
index ef15533..e252902 100644
--- a/lib/Makefile.am
+++ b/lib/Makefile.am
@@ -30,6 +30,6 @@ noinst_HEADERS =
include gnulib.mk
noinst_HEADERS += cmpbuf.h mbcel.h
-libdiffutils_a_SOURCES += cmpbuf.c mbcel.c
+libdiffutils_a_SOURCES += cmpbuf.c mbcel.c mbcel-strcasecmp.c
AM_CFLAGS += $(GNULIB_WARN_CFLAGS) $(WERROR_CFLAGS)
diff --git a/lib/mbcel-strcasecmp.c b/lib/mbcel-strcasecmp.c
new file mode 100644
index 0000000..5d50491
--- /dev/null
+++ b/lib/mbcel-strcasecmp.c
@@ -0,0 +1,63 @@
+/* Case-insensitive string comparison function.
+ Copyright 2023 Free Software Foundation, Inc.
+
+ This file is free software: you can redistribute it and/or modify
+ it under the terms of the GNU Lesser General Public License as
+ published by the Free Software Foundation, either version 3 of the
+ License, or (at your option) any later version.
+
+ This file is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU Lesser General Public License for more details.
+
+ You should have received a copy of the GNU Lesser General Public License
+ along with this program. If not, see <https://www.gnu.org/licenses/>. */
+
+/* Written by Paul Eggert. */
+
+#include <config.h>
+
+/* Specification. */
+#include <mbcel.h>
+
+#include <ctype.h>
+#include <stdlib.h>
+
+/* Compare the multi-byte strings S1 and S2 lexicographically, ignoring case.
+ Return <0, 0, >0 for <, =, >. Consider encoding errors to be
+ greater than characters and compare them byte by byte. */
+
+int
+mbcel_strcasecmp (char const *s1, char const *s2)
+{
+ char const *p1 = s1;
+ char const *p2 = s2;
+
+ /* Do not look at the entire extent of S1 or S2 until needed:
+ when two strings differ, the difference is typically early. */
+ if (MB_CUR_MAX == 1)
+ while (true)
+ {
+ unsigned char c1 = *p1++;
+ unsigned char c2 = *p2++;
+ int cmp = MBCEL_UCHAR_FITS ? c1 - c2 : _GL_CMP (c1, c2);
+ if (_GL_UNLIKELY (cmp))
+ {
+ c1 = tolower (c1);
+ c2 = tolower (c2);
+ cmp = MBCEL_UCHAR_FITS ? c1 - c2 : _GL_CMP (c1, c2);
+ }
+ if (_GL_UNLIKELY (cmp | !c1))
+ return cmp;
+ }
+ else
+ while (true)
+ {
+ mbcel_t g1 = mbcel_scanz (p1); p1 += g1.len;
+ mbcel_t g2 = mbcel_scanz (p2); p2 += g2.len;
+ int cmp = mbcel_casecmp (g1, g2);
+ if (_GL_UNLIKELY (cmp | ! (g1.ch | g1.err)))
+ return cmp;
+ }
+}
diff --git a/lib/mbcel.h b/lib/mbcel.h
index 69f4d5a..001c1c0 100644
--- a/lib/mbcel.h
+++ b/lib/mbcel.h
@@ -37,6 +37,34 @@
You can select from G using G.ch, G.err, and G.len.
+ The mbcel_scanz function is similar except it works with a
+ string of unknown length that is terminated with '\0'.
+ Instead of this single-byte code:
+
+ char *p = ...;
+ for (; *p; p++)
+ process (*p);
+
+ You can use this multi-byte code:
+
+ char *p = ...;
+ for (mbcel_t g; *p; p += g.len)
+ {
+ g = mbcel_scanz (p);
+ process (g);
+ }
+
+ mbcel_scant (P, TERMINATOR) is like mbcel_scanz (P) except the
+ string is terminated by TERMINATOR. The TERMINATORs '\0', '\r',
+ '\n', '.', '/' are safe, as they cannot be a part (even a trailing
+ byte) of a multi-byte character.
+
+ mbcel_cmp (G1, G2) and mbcel_casecmp (G1, G2) compare two mbcel_t
+ values lexicographically by character or by encoding byte value,
+ with encoding bytes sorting after characters. mbcel_casecmp
+ ignores case in characters. mbcel_strcasecmp compares two
+ null-terminated strings lexicographically.
+
Although ISO C and POSIX allow encodings that have shift states or
that can produce multiple characters from an indivisible byte sequence,
POSIX does not require support for these encodings,
@@ -56,6 +84,14 @@
#include <stddef.h>
#include <uchar.h>
+/* The maximum multibyte character length supported on any platform.
+ This can be less than MB_LEN_MAX because many platforms have a
+ large MB_LEN_MAX to allow for stateful encodings, and mbcel does
+ not need to support these encodings. MBCEL_LEN_MAX is enough for
+ UTF-8, EUC, Shift-JIS, GB18030, etc.
+ 0 < MB_CUR_MAX <= MBCEL_LEN_MAX <= MB_LEN_MAX. */
+enum { MBCEL_LEN_MAX = MB_LEN_MAX < 4 ? MB_LEN_MAX : 4 };
+
/* mbcel_t is a type representing a character CH or an encoding error byte ERR,
along with a count of the LEN bytes that represent CH or ERR.
If ERR is zero, CH is a valid character and 1 <= LEN <= MB_LEN_MAX;
@@ -82,23 +118,42 @@ _GL_INLINE_HEADER_BEGIN
# define MBCEL_INLINE _GL_INLINE
#endif
-/* With diffutils there is no need for the performance overhead of
- replacing glibc mbrtoc32, as it doesn't matter whether the C locale
- treats bytes with the high bit set as encoding errors. */
+/* With mbcel there should be no need for the performance overhead of
+ replacing glibc mbrtoc32, as callers shouldn't care whether the
+ C locale treats a byte with the high bit set as an encoding error. */
#ifdef __GLIBC__
# undef mbrtoc32
#endif
+/* Shifting an encoding error byte (which must be at least 2**7)
+ left by 14 yields at least 2**21 (0x200000), which is greater
+ than the maximum Unicode value 0x10FFFF. This suffices to sort
+ encoding errors after characters. */
+enum { MBCEL_ENCODING_ERROR_SHIFT = 14 };
+
+/* In the typical case where unsigned char easily fits in int,
+ optimizations are possible. */
+enum {
+ MBCEL_UCHAR_FITS = UCHAR_MAX <= INT_MAX,
+ MBCEL_UCHAR_EASILY_FITS = UCHAR_MAX <= INT_MAX >> MBCEL_ENCODING_ERROR_SHIFT
+};
+
+#ifndef _GL_LIKELY
+/* Rely on __builtin_expect, as provided by the module 'builtin-expect'. */
+# define _GL_LIKELY(cond) __builtin_expect ((cond), 1)
+# define _GL_UNLIKELY(cond) __builtin_expect ((cond), 0)
+#endif
+
/* Scan bytes from P inclusive to LIM exclusive. P must be less than LIM.
- Return either the representation of the valid character starting at P,
- or the representation of an encoding error of length 1 at P. */
+ Return either the valid character starting at P,
+ or the encoding error of length 1 at P. */
MBCEL_INLINE mbcel_t
mbcel_scan (char const *p, char const *lim)
{
/* Handle ASCII quickly to avoid the overhead of calling mbrtoc32.
In supported encodings, the first byte of a multi-byte character
cannot be an ASCII byte. */
- if (0 <= *p && *p <= 0x7f)
+ if (_GL_LIKELY (0 <= *p && *p <= 0x7f))
return (mbcel_t) { .ch = *p, .len = 1 };
/* An initial mbstate_t; initialization optimized for some platforms.
@@ -117,16 +172,11 @@ mbcel_scan (char const *p, char const *lim)
u.s.ch = u.s.utf8_want = u.s.euc_want = 0;
# define mbs u.m
#elif defined __NetBSD__
- /* Like FreeBSD, but mbstate_t starts with a pointer and
- (before joerg commit dated 2020-06-02 01:30:31 +0000)
- up to another pointer's worth of padding. */
- struct mbhidden {
- struct _RuneLocale *p[2];
- char32_t ch; int utf8_want, euc_want;
- } _GL_ATTRIBUTE_MAY_ALIAS;
+ /* Experiments on both 32- and 64-bit NetBSD platforms have
+ shown that it doesn't work to clear fewer than 24 bytes. */
+ struct mbhidden { long long int a, b, c; } _GL_ATTRIBUTE_MAY_ALIAS;
union { mbstate_t m; struct mbhidden s; } u;
- u.s.p[0] = u.s.p[1] = nullptr;
- u.s.ch = u.s.utf8_want = u.s.euc_want = 0;
+ u.s.a = u.s.b = u.s.c = 0;
# define mbs u.m
#else
/* mbstate_t has unknown structure or is not worth optimizing. */
@@ -137,9 +187,8 @@ mbcel_scan (char const *p, char const *lim)
size_t len = mbrtoc32 (&ch, p, lim - p, &mbs);
/* Any LEN with top bit set is an encoding error, as LEN == (size_t) -3
- is not supported and MB_LEN_MAX <= (size_t) -1 / 2 on all platforms. */
- static_assert (MB_LEN_MAX <= (size_t) -1 / 2);
- if ((size_t) -1 / 2 < len)
+ is not supported and MB_LEN_MAX is small. */
+ if (_GL_UNLIKELY ((size_t) -1 / 2 < len))
return (mbcel_t) { .err = *p, .len = 1 };
/* Tell the compiler LEN is at most MB_LEN_MAX,
@@ -152,6 +201,66 @@ mbcel_scan (char const *p, char const *lim)
return (mbcel_t) { .ch = ch, .len = len };
}
+/* Scan bytes from P, a byte sequence terminated by TERMINATOR.
+ If *P == TERMINATOR, scan just that byte; otherwise scan
+ bytes up to but not including a TERMINATOR byte.
+ TERMINATOR must be ASCII, and should be '\0', '\r', '\n', '.', or '/'.
+ Return either the valid character starting at P,
+ or the encoding error of length 1 at P. */
+MBCEL_INLINE mbcel_t
+mbcel_scant (char const *p, char terminator)
+{
+ /* Handle ASCII quickly for speed. */
+ if (_GL_LIKELY (0 <= *p && *p <= 0x7f))
+ return (mbcel_t) { .ch = *p, .len = 1 };
+
+ /* Defer to mbcel_scan for non-ASCII. Compute length with code that
+ is typically branch-free and faster than memchr or strnlen. */
+ char const *lim = p + 1;
+ for (int i = 0; i < MBCEL_LEN_MAX - 1; i++)
+ lim += *lim != terminator;
+ return mbcel_scan (p, lim);
+}
+
+/* Scan bytes from P, a byte sequence terminated by '\0'.
+ If *P == '\0', scan just that byte; otherwise scan
+ bytes up to but not including a '\0'.
+ Return either the valid character starting at P,
+ or the encoding error of length 1 at P. */
+MBCEL_INLINE mbcel_t
+mbcel_scanz (char const *p)
+{
+ return mbcel_scant (p, '\0');
+}
+
+/* Compare G1 and G2, with encoding errors sorting after characters.
+ Return <0, 0, >0 for <, =, >. */
+MBCEL_INLINE int
+mbcel_cmp (mbcel_t g1, mbcel_t g2)
+{
+ int c1 = g1.ch, c2 = g2.ch, e1 = g1.err, e2 = g2.err, ccmp = c1 - c2,
+ ecmp = MBCEL_UCHAR_EASILY_FITS ? e1 - e2 : _GL_CMP (e1, e2);
+ return (ecmp << MBCEL_ENCODING_ERROR_SHIFT) + ccmp;
+}
+
+/* Compare G1 and G2 ignoring case, with encoding errors sorting after
+ characters. Return <0, 0, >0 for <, =, >. */
+MBCEL_INLINE int
+mbcel_casecmp (mbcel_t g1, mbcel_t g2)
+{
+ int cmp = mbcel_cmp (g1, g2);
+ if (_GL_LIKELY (g1.err | g2.err | !cmp))
+ return cmp;
+ int c1 = c32tolower (g1.ch);
+ int c2 = c32tolower (g2.ch);
+ return c1 - c2;
+}
+
+/* Compare the multi-byte strings S1 and S2 lexicographically, ignoring case.
+ Return <0, 0, >0 for <, =, >. Consider encoding errors to be
+ greater than characters and compare them byte by byte. */
+int mbcel_strcasecmp (char const *s1, char const *s2);
+
_GL_INLINE_HEADER_END
#endif /* _MBCEL_H */
diff --git a/src/dir.c b/src/dir.c
index 0ba38b7..99392a5 100644
--- a/src/dir.c
+++ b/src/dir.c
@@ -26,6 +26,15 @@
#include <setjmp.h>
#include <xalloc.h>
+#if ! HAVE_STRCASECOLL
+# if HAVE_STRICOLL || defined stricoll
+# define strcasecoll(a, b) stricoll (a, b)
+# else
+# include <mbcel.h>
+# define strcasecoll(a, b) mbcel_strcasecmp (a, b) /* best we can do */
+# endif
+#endif
+
/* A sorted vector of file names obtained by reading a directory. */
struct dirdata
diff --git a/src/system.h b/src/system.h
index 71c9d4c..ca4d27d 100644
--- a/src/system.h
+++ b/src/system.h
@@ -51,23 +51,12 @@
#include <stdlib.h>
#define EXIT_TROUBLE 2
+#include <inttypes.h>
#include <limits.h>
#include <locale.h>
#include <stdckdint.h>
#include <stddef.h>
-#include <inttypes.h>
-
#include <string.h>
-#if ! HAVE_STRCASECOLL
-# if HAVE_STRICOLL || defined stricoll
-# define strcasecoll(a, b) stricoll (a, b)
-# else
-# define strcasecoll(a, b) strcasecmp (a, b) /* best we can do */
-# endif
-#endif
-#if ! (HAVE_STRCASECMP || defined strcasecmp)
-int strcasecmp (char const *, char const *);
-#endif
#include <gettext.h>
#if ! ENABLE_NLS
--
2.39.2