Hi Paul, > >> Although I'm sure mbiter can be improved I > >> don't see how it could catch up to mbcel so long as it continues to > >> solve a harder problem than mbcel solves. > > > > I don't know exactly what you mean by "harder problem". > > I meant that it solves a harder porting problem because it worries about > more issues, e.g., it worries about mbrtoc32 returning (size_t) -3,
This makes only for a small performance difference. I could measure it by doing the benchmarks of mbiterf-bench-tests mbuiterf-bench-tests versus mbiterf-bench-tests mbuiterf-bench-tests mbrtoc32-regular In the latter case, the lines marked with #if !GNULIB_MBRTOC32_REGULAR are optimized away. > or returning (size_t) -1 in the C locale. Indeed, this shows as a difference between mbiterf and mbcel in the test cases c, f: mbiterf mbcel mbuiterf c 1.145 0.670 1.179 f 13.028 5.714 14.654 But since the glibc people are already working on resolving this issue, I won't spend time optimizing it one way or the other. > > The other significant difference that I see is the handling of multibyte > > sequences. When there 2 or 3 bytes (of, say, UTF-8) that constitute an > > incomplete multibyte character at the end of the string, > > This isn't a problem for programs like grep and diff, where there's > always a newline at the end the input buffer. I disagree: Any program can run into it when the input is <some valid UTF-8 characters><an incomplete UTF-8 character><newline> My screenshot from the 'src/diff -y -t' output in an xterm also shows that there is an issue. > > - mbcel returns each byte, one by one, as a character without a > > char32_t code. > > (A nit: it's not a character; it's an encoding error.) Sure. Some programs then treat that error as if an U+FFFD character had been read. > > - ISO 10646 says ([1] section R.7) "it shall interpret that malformed > > sequence in the same way that it interprets a character that is outside > > the adopted subset". > > If I understand this requirement correctly mbcel satisfies it, as mbcel > treats those two things in the same way, namely, as sequences of > encoding error bytes. No, I don't think mbcel satisfies it, since mbcel interprets the "malformed sequence" not like "a character" but like multiple characters. > > - Markus Kuhn's example ([2] section 3) has a section where > > "All bytes of an incomplete sequence should be signalled as a single > > malformed sequence, i.e., you should see only a single replacement > > character in each of the next 10 tests." > > Kuhn is talking about programs that display characters to users and that > need some way to signal encoding errors. But diff is not such a program: > it doesn't need to display a signal for an incomplete sequence, because > it's not responsible for display. Kuhn's writeup is generally about UTF-8 decoding. In the year 2000, when it was written, the most important decoders were in display engines of terminal emulators. Nowadays, we have UTF-8 decoders in many many programs. The Unicode Standard has several pages about this topic: Unicode 15.0 section 3.9 https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf pages 124..129 It is also referenced by section 5.22 https://www.unicode.org/versions/Unicode15.0.0/ch05.pdf page 255. The relevant text starts at page 127: "U+FFFD Substitution of Maximal Subparts An increasing number of implementations are adopting the handling of ill-formed subsequences as specified in the W3C standard for encoding to achieve consistent U+FFFD replacements. ... Although the Unicode Standard does not require this practice for conformance ..." See also the table 3-11 on page 128. So, clearly, this is not a *requirement* for a conforming UTF-8 decoder. But the Unicode Standard's authors would not describe it in this great length if it wasn't a good practice. > It's certainly not > typical practice in the GNU/Linux world. It's not true of the first five > applications that I tested on Ubuntu 23.04: Emacs, Chrome, Firefox, > less, and gnome-terminal. Well, then I'll have to write a couple of QoI (quality of implementation) reports... (xterm does it right, but you are right that nowadays gnome-terminal and other vte-based terminal emulators are the majority.) > Even if Kuhn's suggestion were good for display programs, programs like > diff should not treat differing encoding error byte sequences as if they > were equivalent. If two files A and B contain different encoding errors > I expect most users would prefer "diff A B" to report the differences. Sure. If you were to use the 'mbiterf' module instead of mbcel, the mb_equal macro from mbchar.h does the right thing. Yes, an mb_equal call is a bit more complicated than the same_ch_err definition that you have in diffutils/src/io.c. That's the unavoidable consequence of treating a sequence of 2 or 3 bytes as *one* error. > There's not > even a standard column width for U+FFFD itself: Kuhn recommends 1 in > <https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c>, but 2 is more common in > my experience. Indeed, width considerations of strings with control characters are hairy. And U+FFFD counts as a control character, according to wcwidth(0xFFFD) == -1 (on glibc systems). > > This may be acceptable as a corner case for 'diff'. But for a module offered > > by Gnulib, we should IMO continue to follow the best practice here. > > Although Kuhn's suggestion may be best practice for some applications, > it's not best for applications like diff, and it would be helpful if > Gnulib could support these applications. According to what I read in the Unicode Standard (above), it's a best practice for all kinds of applications. I'm not asking to rewrite the new code that you added in 'diff'. But for other programs, from 'ls' to 'sed', I continue to think it would be a good idea to follow that best practice. Bruno