> From: Bruno Haible <br...@clisp.org> > Cc: bug-texinfo@gnu.org > Date: Mon, 09 Oct 2023 18:15:05 +0200 > > Eli Zaretskii wrote: > > unless the locale's codeset is UTF-8, any character that is not > > printable _in_the_current_locale_ will return -1 from wcwidth. I'm > > guessing that no one has ever tried to run the test suite in a > > non-UTF-8 locale before? > > I just tried it now: On Linux (Ubuntu 22.04), in a de_DE.UTF-8 locale, > texinfo 7.0.93 build fine and all tests pass.
de_DE.UTF-8 is a UTF-8 locale. I asked about non-UTF-8 locales. An example would be de_DE.ISO8859-1. Or what am I missing? > > Yes, quite a few characters return -1 from wcwidth, in particular the > > ȷ character above (which explains the above difference). > > This character is U+0237 LATIN SMALL LETTER DOTLESS J. It *should* be > recognized as having a width of 1 in all implementations of wcwidth. But if U+0237 cannot be represented in the locale's codeset, its width can not be 1, because it cannot be printed. This is my interpretation of the standard's language (emphasis mine): DESCRIPTION The wcwidth() function shall determine the number of column positions required for the wide character wc. The application shall ensure that the value of wc is a character representable as a wchar_t, and is a wide-character code corresponding to a valid character in the current locale. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RETURN VALUE The wcwidth() function shall either return 0 (if wc is a null wide-character code), or return the number of column positions to be occupied by the wide-character code wc, or return -1 (if wc does not correspond to a printable wide-character code). ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Since U+0237 is not printable in my locale (it isn't supported by the system codepage), the value -1 is correct. Am I missing something? > There's no reason for it to have a width of -1, since it's not a control > character. > There's no reason for it to have a width of 0, since it's not a combining > mark or a non-spacing character. > There's no reason for it to have a width of 2, since it's not a CJK character > and not in a Unicode range with many CJK characters. I think you assume that all the Unicode letter characters are always printable in every locale. That's not what I understand, and iswprint agrees with me, because I get -1 for U+0237 due to this code: > > return wc == 0 ? 0 : iswprint (wc) ? 1 : -1; > > I don't think the above logic in Gnulib's wcwidth (which basically > > replicates the logic in any reasonable wcwidth implementation, so is > > not specific to Gnulib) fits what Texinfo needs. Texinfo needs to be > > able to produce output independently of the locale. What matters to > > Texinfo is the encoding of the output document, not the locale's > > codeset. So I think we should call uc_width when the output document > > encoding is UTF-8 (which is the default, including in the above test), > > regardless of the locale's codeset. Or we could use a simpler > > approximation: > > > > return wc == 0 ? 0 : iswcntrl (wc) ? 0 : 1; > > This "simpler approximation" would not return a good result when wc > is a control character (such as CR, LF, TAB, or such). It is important > that the caller of wcwidth() or wcswidth() is able to recognize that > the string as a whole does not have a definite width. It is still better than returning -1, don't you agree? But for some reason you completely ignored my more general comment about what Texinfo needs from wcwidth.