On 2023-07-02 13:18, Bruno Haible wrote:
If (size_t) -3 is possible, I suppose I should change diffutils to take
this into account, as bleeding-edge diffutils/src/side.c treats (size_t)
-3 as meaning the next input byte is an encoding error, which is
obviously wrong.
If you want the diffutils code to be future-proof, yes.
Given your explanation, it doesn't sound like it's worth the effort.
Having to worry about the only-theoretical possibility of mbstoc32
disgorging wide characters without consuming input bytes would
significantly complicate diffutils, something I'm loath to do if this
never occurs in practice.
The simplest way to fix this would be for diffutils to
go back to using wchar_t,
?? We are talking about 2 lines of code
Not for diffutils we aren't. If I understand things correctly, diffutils
would have to look ahead to the next mbrtoc32 call that returns a
nonnegative value before deciding what to do about the previous N calls
where the first returned a positive value and the remaining calls
returned (size_t) -3. This sort of lookahead would be doable but painful
with significant performance implications.
The complication would be needed because diffutils is trying to count
columns as it goes, and in some cases it needs to stop when a column
count has reached a maximum. It's not two lines of code. I'd guess we're
talking closer to two hundred lines scattered about three source files
(OK, four, since I should create a new .h file to handle the mess). And
performance would possibly degrade significantly for typical usage,
unless I did something like an "#ifdef THEORETICAL_PLATFORM".
With plain wchar_t (as opposed to char32_t), character classes and print widths
of non-BMP characters come out wrong on Cygwin, native Windows, and 32-bit AIX.
Thanks, that was what I was guessing and it's helpful to have it
confirmed. As long as switching to char32_t doesn't significantly hurt
the GNU platform cases it's worth it to stick with char32_t.
I had been worried that one of these platforms would return (size_t) -3
and in that case I supposed we would need to switch diffutils back to
wchar_t for portability to these platforms without worrying about -3.
I'm glad to hear this is not the case.