Re: mbcel module for Gnulib?, incomplete multibyte sequences

Paul Eggert Mon, 17 Jul 2023 16:09:47 -0700

On 2023-07-16 15:18, Bruno Haible wrote:

Paul Eggert wrote:

Although I'm sure mbiter can be improved I
don't see how it could catch up to mbcel so long as it continues to
solve a harder problem than mbcel solves.


I don't know exactly what you mean by "harder problem".

I meant that it solves a harder porting problem because it worries aboutmore issues, e.g., it worries about mbrtoc32 returning (size_t) -3, orreturning (size_t) -1 in the C locale. I guess it also worries aboutcolumn counting (something I hadn't thought about but your email raisedthe issue). There are probably other things that it worries about, thatmbcel does not. The more things mbiter needs to worry about the slowerit gets.

The other significant difference that I see is the handling of multibyte
sequences. When there 2 or 3 bytes (of, say, UTF-8) that constitute an
incomplete multibyte character at the end of the string,

This isn't a problem for programs like grep and diff, where there'salways a newline at the end the input buffer.

   - mbcel returns each byte, one by one, as a character without a
     char32_t code.


(A nit: it's not a character; it's an encoding error.)

   - ISO 10646 says ([1] section R.7) "it shall interpret that malformed
     sequence in the same way that it interprets a character that is outside
     the adopted subset".

If I understand this requirement correctly mbcel satisfies it, as mbceltreats those two things in the same way, namely, as sequences ofencoding error bytes.

   - Markus Kuhn's example ([2] section 3) has a section where
       "All bytes of an incomplete sequence should be signalled as a single
        malformed sequence, i.e., you should see only a single replacement
        character in each of the next 10 tests."

Kuhn is talking about programs that display characters to users and thatneed some way to signal encoding errors. But diff is not such a program:it doesn't need to display a signal for an incomplete sequence, becauseit's not responsible for display.

Even for the class of programs that Kuhn is talking about it's not clearthat the practice he recommends is a good one. It's certainly nottypical practice in the GNU/Linux world. It's not true of the first fiveapplications that I tested on Ubuntu 23.04: Emacs, Chrome, Firefox,less, and gnome-terminal.

Even if Kuhn's suggestion were good for display programs, programs likediff should not treat differing encoding error byte sequences as if theywere equivalent. If two files A and B contain different encoding errorsI expect most users would prefer "diff A B" to report the differences.

I take the point that diff's column counting disagrees with Kuhn'ssuggestion. However, there's no standard for columnar display ofencoding errors. Some programs display each encoding byte as asingle-column character. Some do it as a two-column character. Emacs bydefault uses four columns. xterm, the program that you mention, isglitchy: sometimes it displays a UTF-8-like sequence as a single-columnU+FFFD REPLACEMENT CHARACTER but sometimes it doesn't, and on myplatform, when I cat Kuhn's test to standard output, two of the fourtests in the last screenful fail to line up their columns. There's noteven a standard column width for U+FFFD itself: Kuhn recommends 1 in<https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c>, but 2 is more common inmy experience.

In short, in practice there's no way for diff to tell how encodingerrors are displayed. diff currently guesses 1 column per encoding errorbyte, as that's an easy guess. It's not clear that complicating thisguess would be a net win for diff users. Which means mbcel is goodenough for diff.

(Composing this email prompted me to document this issue better in thediffutils manual, so I installed the attached patch there.)

This may be acceptable as a corner case for 'diff'. But for a module offered
by Gnulib, we should IMO continue to follow the best practice here.

Although Kuhn's suggestion may be best practice for some applications,it's not best for applications like diff, and it would be helpful ifGnulib could support these applications.

From 856f72409b62d9d3459270971dbca9d155559e31 Mon Sep 17 00:00:00 2001
From: Paul Eggert <[email protected]>
Date: Mon, 17 Jul 2023 12:48:30 -0700
Subject: [PATCH] doc: document tab behavior better

* doc/diffutils.texi (Tabs): Document issues with tabs,
encoding errors, and non-ASCII characters.
---
 doc/diffutils.texi | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/doc/diffutils.texi b/doc/diffutils.texi
index 6f2657f..c2d42bd 100644
--- a/doc/diffutils.texi
+++ b/doc/diffutils.texi
@@ -1904,6 +1904,10 @@ These adjustments can be applied to any output format.
 @cindex tab stop alignment
 @cindex aligning tab stops
 
+The tab character moves the cursor to the next tab stop.
+Tab stops are normally every 8 display columns;
+this can be altered by the @option{--tabsize=@var{columns}} option.
+
 The lines of text in some of the @command{diff} output formats are
 preceded by one or two characters that indicate whether the text is
 inserted, deleted, or changed.  The addition of those characters can
@@ -1916,9 +1920,7 @@ number of spaces before outputting them; select this method with the
 @option{--expand-tabs} (@option{-t}) option.  To use this form of output with
 @command{patch}, you must give @command{patch} the @option{-l} or
 @option{--ignore-white-space} option (@pxref{Changed White Space}, for more
-information).  @command{diff} normally assumes that tab stops are set
-every 8 print columns, but this can be altered by the
-@option{--tabsize=@var{columns}} option.
+information).
 
 The other method for making tabs line up correctly is to add a tab
 character instead of a space after the indicator character at the
@@ -1931,6 +1933,15 @@ output format, which does not have a space character after the change
 type indicator character.  Select this method with the @option{-T} or
 @option{--initial-tab} option.
 
+GNU @command{diff} currently assumes that the output device respects tab stops,
+displays each character with column width as given by the operating system,
+and displays each encoding error byte in a single column.
+Unfortunately these assumptions are often incorrect
+for encoding errors and non-ASCII characters,
+so complex input data may not line up properly on output,
+and analysis based on the @option{--ignore-tab-expansion} (@option{-E}) option
+may differ from the display device's behavior.
+
 @node Trailing Blanks
 @section Omitting trailing blanks
 @cindex trailing blanks
-- 
2.39.2

Re: mbcel module for Gnulib?, incomplete multibyte sequences

Reply via email to