Paul Eggert wrote: > The complication would be needed because diffutils is trying to count > columns as it goes, and in some cases it needs to stop when a column > count has reached a maximum. It's not two lines of code.
Indeed. I need to check the mbiter and mbuiter modules, since they do something similar... In the big picture, we are talking about levels of perfection that may happen in the described situation: Level 1: Behave incorrectly but don't crash. This is what code that uses mbrtowc() does. See my glibc bug report https://sourceware.org/bugzilla/show_bug.cgi?id=30611 Level 2: Behave correctly, except that a 2-Unicode-character sequence may be split although it shouldn't. This is what code that uses mbrtoc32() does, when it has the lines if (bytes == (size_t) -3) bytes = 0; Without these lines, the string pointer could be decremented by 3, thus accessing invalid memory or running into an endless loop. This level is also what printf (".*s", nbytes, string); does: it truncates strings at a position where they should not be truncated. So, it's not terribly uncommon. Level 3: Behave correctly. Don't split a 2-Unicode-character sequence. This is what code that uses mbrtoc32() does, when it has the lines if (bytes == (size_t) -3) bytes = 0; and uses !mbsinit (&state) in the loop termination condition. > > ?? We are talking about 2 lines of code > > Not for diffutils we aren't. If I understand things correctly, diffutils > would have to look ahead to the next mbrtoc32 call that returns a > nonnegative value before deciding what to do about the previous N calls > where the first returned a positive value and the remaining calls > returned (size_t) -3. This sort of lookahead would be doable but painful > with significant performance implications. You're right that more than 2 lines of code are needed. But I think, with the help of an mbsinit (&state) test, the added code and performance implications can be kept small. > Given your explanation, it doesn't sound like it's worth the effort. I agree, and I explained in the glibc bug report that I would like the zh_HK.BIG5-HKSCS locale to go away. > I had been worried that one of these platforms would return (size_t) -3 > and in that case I supposed we would need to switch diffutils back to > wchar_t for portability to these platforms without worrying about -3. > I'm glad to hear this is not the case. No need to worry in this direction. Cygwin and native Windows don't support as many encodings as glibc does. Bruno 2023-07-03 Bruno Haible <br...@clisp.org> mbrtoc32: Document another glibc bug. * doc/posix-functions/mbrtoc32.texi: Reference the glibc bug in BIG5-HKSCS locales. diff --git a/doc/posix-functions/mbrtoc32.texi b/doc/posix-functions/mbrtoc32.texi index 93a7aa64ff..3528114bec 100644 --- a/doc/posix-functions/mbrtoc32.texi +++ b/doc/posix-functions/mbrtoc32.texi @@ -38,6 +38,11 @@ Portability problems not fixed by Gnulib: @itemize @item +This function behaves incorrectly when converting precomposed characters +from the BIG5-HKSCS encoding: +@c https://sourceware.org/bugzilla/show_bug.cgi?id=30611 +glibc 2.36. +@item Although ISO C says this function can return @code{(size_t) -3}, no known implementation behaves that way, and if it were to happen it would break common uses.