Re: mbcel module for Gnulib?, incomplete multibyte sequences

Bruno Haible Thu, 20 Jul 2023 08:29:03 -0700

Hi Paul,

> >> Although I'm sure mbiter can be improved I
> >> don't see how it could catch up to mbcel so long as it continues to
> >> solve a harder problem than mbcel solves.
> > 
> > I don't know exactly what you mean by "harder problem".
> 
> I meant that it solves a harder porting problem because it worries about 
> more issues, e.g., it worries about mbrtoc32 returning (size_t) -3,


This makes only for a small performance difference. I could measure it
by doing the benchmarks of
  mbiterf-bench-tests mbuiterf-bench-tests
versus
  mbiterf-bench-tests mbuiterf-bench-tests mbrtoc32-regular

In the latter case, the lines marked with
  #if !GNULIB_MBRTOC32_REGULAR
are optimized away.

> or returning (size_t) -1 in the C locale.

Indeed, this shows as a difference between mbiterf and mbcel in the
test cases c, f:

    mbiterf mbcel   mbuiterf
c    1.145  0.670    1.179
f   13.028  5.714   14.654

But since the glibc people are already working on resolving this issue,
I won't spend time optimizing it one way or the other.

> > The other significant difference that I see is the handling of multibyte
> > sequences. When there 2 or 3 bytes (of, say, UTF-8) that constitute an
> > incomplete multibyte character at the end of the string,
> 
> This isn't a problem for programs like grep and diff, where there's 
> always a newline at the end the input buffer.

I disagree: Any program can run into it when the input is
  <some valid UTF-8 characters><an incomplete UTF-8 character><newline>
My screenshot from the 'src/diff -y -t' output in an xterm also shows
that there is an issue.

> >    - mbcel returns each byte, one by one, as a character without a
> >      char32_t code.
> 
> (A nit: it's not a character; it's an encoding error.)

Sure. Some programs then treat that error as if an U+FFFD character
had been read.

> >    - ISO 10646 says ([1] section R.7) "it shall interpret that malformed
> >      sequence in the same way that it interprets a character that is outside
> >      the adopted subset".
> 
> If I understand this requirement correctly mbcel satisfies it, as mbcel 
> treats those two things in the same way, namely, as sequences of 
> encoding error bytes.

No, I don't think mbcel satisfies it, since mbcel interprets the
"malformed sequence" not like "a character" but like multiple characters.

> >    - Markus Kuhn's example ([2] section 3) has a section where
> >        "All bytes of an incomplete sequence should be signalled as a single
> >         malformed sequence, i.e., you should see only a single replacement
> >         character in each of the next 10 tests."
> 
> Kuhn is talking about programs that display characters to users and that 
> need some way to signal encoding errors. But diff is not such a program: 
> it doesn't need to display a signal for an incomplete sequence, because 
> it's not responsible for display.

Kuhn's writeup is generally about UTF-8 decoding. In the year 2000, when it
was written, the most important decoders were in display engines of terminal
emulators. Nowadays, we have UTF-8 decoders in many many programs.

The Unicode Standard has several pages about this topic:
Unicode 15.0 section 3.9
  https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf  pages 124..129
It is also referenced by section 5.22
  https://www.unicode.org/versions/Unicode15.0.0/ch05.pdf  page 255.

The relevant text starts at page 127:
  "U+FFFD Substitution of Maximal Subparts

   An increasing number of implementations are adopting the handling of
   ill-formed subsequences as specified in the W3C standard for encoding
   to achieve consistent U+FFFD replacements. ...
   Although the Unicode Standard does not require this practice for conformance
   ..."
See also the table 3-11 on page 128.

So, clearly, this is not a *requirement* for a conforming UTF-8 decoder.
But the Unicode Standard's authors would not describe it in this great length
if it wasn't a good practice.

> It's certainly not 
> typical practice in the GNU/Linux world. It's not true of the first five 
> applications that I tested on Ubuntu 23.04: Emacs, Chrome, Firefox, 
> less, and gnome-terminal.

Well, then I'll have to write a couple of QoI (quality of implementation)
reports...

(xterm does it right, but you are right that nowadays gnome-terminal and other
vte-based terminal emulators are the majority.)

> Even if Kuhn's suggestion were good for display programs, programs like 
> diff should not treat differing encoding error byte sequences as if they 
> were equivalent. If two files A and B contain different encoding errors 
> I expect most users would prefer "diff A B" to report the differences.

Sure. If you were to use the 'mbiterf' module instead of mbcel, the
mb_equal macro from mbchar.h does the right thing. Yes, an mb_equal
call is a bit more complicated than the same_ch_err definition that
you have in diffutils/src/io.c. That's the unavoidable consequence of
treating a sequence of 2 or 3 bytes as *one* error.

> There's not 
> even a standard column width for U+FFFD itself: Kuhn recommends 1 in 
> <https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c>, but 2 is more common in 
> my experience.

Indeed, width considerations of strings with control characters are
hairy. And U+FFFD counts as a control character, according to
  wcwidth(0xFFFD) == -1
(on glibc systems).

> > This may be acceptable as a corner case for 'diff'. But for a module offered
> > by Gnulib, we should IMO continue to follow the best practice here.
> 
> Although Kuhn's suggestion may be best practice for some applications, 
> it's not best for applications like diff, and it would be helpful if 
> Gnulib could support these applications.

According to what I read in the Unicode Standard (above), it's a best practice
for all kinds of applications.

I'm not asking to rewrite the new code that you added in 'diff'. But for
other programs, from 'ls' to 'sed', I continue to think it would be a good
idea to follow that best practice.

Bruno

Re: mbcel module for Gnulib?, incomplete multibyte sequences

Reply via email to