On 2023-07-16 15:18, Bruno Haible wrote:
Paul Eggert wrote:
Although I'm sure mbiter can be improved I
don't see how it could catch up to mbcel so long as it continues to
solve a harder problem than mbcel solves.

I don't know exactly what you mean by "harder problem".

I meant that it solves a harder porting problem because it worries about more issues, e.g., it worries about mbrtoc32 returning (size_t) -3, or returning (size_t) -1 in the C locale. I guess it also worries about column counting (something I hadn't thought about but your email raised the issue). There are probably other things that it worries about, that mbcel does not. The more things mbiter needs to worry about the slower it gets.


The other significant difference that I see is the handling of multibyte
sequences. When there 2 or 3 bytes (of, say, UTF-8) that constitute an
incomplete multibyte character at the end of the string,

This isn't a problem for programs like grep and diff, where there's always a newline at the end the input buffer.


   - mbcel returns each byte, one by one, as a character without a
     char32_t code.

(A nit: it's not a character; it's an encoding error.)


   - ISO 10646 says ([1] section R.7) "it shall interpret that malformed
     sequence in the same way that it interprets a character that is outside
     the adopted subset".

If I understand this requirement correctly mbcel satisfies it, as mbcel treats those two things in the same way, namely, as sequences of encoding error bytes.


   - Markus Kuhn's example ([2] section 3) has a section where
       "All bytes of an incomplete sequence should be signalled as a single
        malformed sequence, i.e., you should see only a single replacement
        character in each of the next 10 tests."

Kuhn is talking about programs that display characters to users and that need some way to signal encoding errors. But diff is not such a program: it doesn't need to display a signal for an incomplete sequence, because it's not responsible for display.

Even for the class of programs that Kuhn is talking about it's not clear that the practice he recommends is a good one. It's certainly not typical practice in the GNU/Linux world. It's not true of the first five applications that I tested on Ubuntu 23.04: Emacs, Chrome, Firefox, less, and gnome-terminal.

Even if Kuhn's suggestion were good for display programs, programs like diff should not treat differing encoding error byte sequences as if they were equivalent. If two files A and B contain different encoding errors I expect most users would prefer "diff A B" to report the differences.

I take the point that diff's column counting disagrees with Kuhn's suggestion. However, there's no standard for columnar display of encoding errors. Some programs display each encoding byte as a single-column character. Some do it as a two-column character. Emacs by default uses four columns. xterm, the program that you mention, is glitchy: sometimes it displays a UTF-8-like sequence as a single-column U+FFFD REPLACEMENT CHARACTER but sometimes it doesn't, and on my platform, when I cat Kuhn's test to standard output, two of the four tests in the last screenful fail to line up their columns. There's not even a standard column width for U+FFFD itself: Kuhn recommends 1 in <https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c>, but 2 is more common in my experience.

In short, in practice there's no way for diff to tell how encoding errors are displayed. diff currently guesses 1 column per encoding error byte, as that's an easy guess. It's not clear that complicating this guess would be a net win for diff users. Which means mbcel is good enough for diff.

(Composing this email prompted me to document this issue better in the diffutils manual, so I installed the attached patch there.)


This may be acceptable as a corner case for 'diff'. But for a module offered
by Gnulib, we should IMO continue to follow the best practice here.

Although Kuhn's suggestion may be best practice for some applications, it's not best for applications like diff, and it would be helpful if Gnulib could support these applications.
From 856f72409b62d9d3459270971dbca9d155559e31 Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Mon, 17 Jul 2023 12:48:30 -0700
Subject: [PATCH] doc: document tab behavior better

* doc/diffutils.texi (Tabs): Document issues with tabs,
encoding errors, and non-ASCII characters.
---
 doc/diffutils.texi | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/doc/diffutils.texi b/doc/diffutils.texi
index 6f2657f..c2d42bd 100644
--- a/doc/diffutils.texi
+++ b/doc/diffutils.texi
@@ -1904,6 +1904,10 @@ These adjustments can be applied to any output format.
 @cindex tab stop alignment
 @cindex aligning tab stops
 
+The tab character moves the cursor to the next tab stop.
+Tab stops are normally every 8 display columns;
+this can be altered by the @option{--tabsize=@var{columns}} option.
+
 The lines of text in some of the @command{diff} output formats are
 preceded by one or two characters that indicate whether the text is
 inserted, deleted, or changed.  The addition of those characters can
@@ -1916,9 +1920,7 @@ number of spaces before outputting them; select this method with the
 @option{--expand-tabs} (@option{-t}) option.  To use this form of output with
 @command{patch}, you must give @command{patch} the @option{-l} or
 @option{--ignore-white-space} option (@pxref{Changed White Space}, for more
-information).  @command{diff} normally assumes that tab stops are set
-every 8 print columns, but this can be altered by the
-@option{--tabsize=@var{columns}} option.
+information).
 
 The other method for making tabs line up correctly is to add a tab
 character instead of a space after the indicator character at the
@@ -1931,6 +1933,15 @@ output format, which does not have a space character after the change
 type indicator character.  Select this method with the @option{-T} or
 @option{--initial-tab} option.
 
+GNU @command{diff} currently assumes that the output device respects tab stops,
+displays each character with column width as given by the operating system,
+and displays each encoding error byte in a single column.
+Unfortunately these assumptions are often incorrect
+for encoding errors and non-ASCII characters,
+so complex input data may not line up properly on output,
+and analysis based on the @option{--ignore-tab-expansion} (@option{-E}) option
+may differ from the display device's behavior.
+
 @node Trailing Blanks
 @section Omitting trailing blanks
 @cindex trailing blanks
-- 
2.39.2

Reply via email to