On 2023-07-02 13:18, Bruno Haible wrote:
If (size_t) -3 is possible, I suppose I should change diffutils to take
this into account, as bleeding-edge diffutils/src/side.c treats (size_t)
-3 as meaning the next input byte is an encoding error, which is
obviously wrong.
If you want the diffutils code to be future-proof, yes.

Given your explanation, it doesn't sound like it's worth the effort. Having to worry about the only-theoretical possibility of mbstoc32 disgorging wide characters without consuming input bytes would significantly complicate diffutils, something I'm loath to do if this never occurs in practice.


The simplest way to fix this would be for diffutils to
go back to using wchar_t,
?? We are talking about 2 lines of code

Not for diffutils we aren't. If I understand things correctly, diffutils would have to look ahead to the next mbrtoc32 call that returns a nonnegative value before deciding what to do about the previous N calls where the first returned a positive value and the remaining calls returned (size_t) -3. This sort of lookahead would be doable but painful with significant performance implications.

The complication would be needed because diffutils is trying to count columns as it goes, and in some cases it needs to stop when a column count has reached a maximum. It's not two lines of code. I'd guess we're talking closer to two hundred lines scattered about three source files (OK, four, since I should create a new .h file to handle the mess). And performance would possibly degrade significantly for typical usage, unless I did something like an "#ifdef THEORETICAL_PLATFORM".


With plain wchar_t (as opposed to char32_t), character classes and print widths
of non-BMP characters come out wrong on Cygwin, native Windows, and 32-bit AIX.

Thanks, that was what I was guessing and it's helpful to have it confirmed. As long as switching to char32_t doesn't significantly hurt the GNU platform cases it's worth it to stick with char32_t.

I had been worried that one of these platforms would return (size_t) -3 and in that case I supposed we would need to switch diffutils back to wchar_t for portability to these platforms without worrying about -3. I'm glad to hear this is not the case.

Reply via email to