> From: Gavin Smith <gavinsmith0...@gmail.com> > Date: Sun, 8 Oct 2023 20:21:44 +0100 > Cc: bug-texinfo@gnu.org > > Just comparing the first line in the hunk: > > -(ì) @'{e} é (é) @'{@dotless{i}} í (í) @dotless{i} ı (ı) @dotless{j} ȷ > +(ì) @'{e} é (é) @'{@dotless{i}} í (í) @dotless{i} ı (ı) @dotless{j} ȷ (ȷ) > > the line you are getting is longer than the reference results. > > I wonder if for some of the non-ASCII characters wcwidth is returning 0 or > -1 leading the line to be longer.
Yes, quite a few characters return -1 from wcwidth, in particular the ȷ character above (which explains the above difference). > It's also possible that other codepoints have inconsistent wcwidth results, > especially for combining accents. > > Do you know if it is the gnulib implementation of wcwidth that is being > used or a MinGW one? AFAIK, MinGW doesn't have wcwidth, so we are using the one from Gnulib. But what Gnulib does in this case is not what Texinfo expects, I think: int wcwidth (wchar_t wc) #undef wcwidth { /* In UTF-8 locales, use a Unicode aware width function. */ if (is_locale_utf8_cached ()) { /* We assume that in a UTF-8 locale, a wide character is the same as a Unicode character. */ return uc_width (wc, "UTF-8"); } else { /* Otherwise, fall back to the system's wcwidth function. */ #if HAVE_WCWIDTH return wcwidth (wc); #else return wc == 0 ? 0 : iswprint (wc) ? 1 : -1; #endif } } IOW, unless the locale's codeset is UTF-8, any character that is not printable _in_the_current_locale_ will return -1 from wcwidth. I'm guessing that no one has ever tried to run the test suite in a non-UTF-8 locale before? I don't think the above logic in Gnulib's wcwidth (which basically replicates the logic in any reasonable wcwidth implementation, so is not specific to Gnulib) fits what Texinfo needs. Texinfo needs to be able to produce output independently of the locale. What matters to Texinfo is the encoding of the output document, not the locale's codeset. So I think we should call uc_width when the output document encoding is UTF-8 (which is the default, including in the above test), regardless of the locale's codeset. Or we could use a simpler approximation: return wc == 0 ? 0 : iswcntrl (wc) ? 0 : 1; CC'ing Bruno who I think knows much more about this.