Hi, > > 15 | e: {ββΓπΈβ = -βπ΅β/βt} > > - | ^~~~~~~~~~~~~~ > > + | ^~~~~~~~~~~~~~~~~
Indeed, mbswidth seems to have returned 3 more columns. > The error (three more columns than expected) seems to indicate something > related to the combining arrow. No. The issue comes from the math symbols. The following test programs shows it: #include <config.h> #include <stdio.h> #include <locale.h> #include <wchar.h> #include "mbswidth.h" int main () { setlocale (LC_ALL, "en_US.UTF-8"); printf ("%d\n", (int) mbswidth ("{ββΓπΈβ = -βπ΅β/βt}",0)); // 14 vs 17 printf ("%d\n", wcwidth (0x2207)); // 1 vs. 2 printf ("%d\n", wcwidth (0x20D7)); // 0 printf ("%d\n", wcwidth (0x00D7)); // 1 printf ("%d\n", wcwidth (0x1D438)); // 1 printf ("%d\n", wcwidth (0x2202)); // 1 vs. 2 printf ("%d\n", wcwidth (0x1D435)); // 1 } The following patch should fix it. The patch changes the behaviour of wcwidth(0x2202) for UTF-8 locales. It would be possible to limit the change to the non-East-Asian UTF-8 locales (by using the function uc_locale_language() and testing whether its result is not one of "zh", "ja", "ko"), but glibc does not do this (it uses the same width across all UTF-8 locales), therefore I'm not doing it here either. 2019-05-05 Bruno Haible <br...@clisp.org> wcwidth: Ensure width 1, not 2, for ambiguous characters. Reported by Kiyoshi KANAZAWA <yoi_no_myou...@yahoo.co.jp> via Akim Demaille <akim.demai...@gmail.com>. * m4/wcwidth.m4 (gl_FUNC_WCWIDTH): Check the width of U+2202. Use an en_US.UTF-8 locale, since that is more likely to be present than an fr_FR.UTF-8 locale. * tests/test-wcwidth.c (main): Check the width of U+2202. * doc/posix-functions/wcwidth.texi: Mention the issue. diff --git a/m4/wcwidth.m4 b/m4/wcwidth.m4 index 3952fd2..e9b5bf4 100644 --- a/m4/wcwidth.m4 +++ b/m4/wcwidth.m4 @@ -1,4 +1,4 @@ -# wcwidth.m4 serial 28 +# wcwidth.m4 serial 29 dnl Copyright (C) 2006-2019 Free Software Foundation, Inc. dnl This file is free software; the Free Software Foundation dnl gives unlimited permission to copy and/or distribute it, @@ -54,6 +54,8 @@ AC_DEFUN([gl_FUNC_WCWIDTH], dnl On OSF/1 5.1, wcwidth(0x200B) (ZERO WIDTH SPACE) returns 1. dnl On OpenBSD 5.8, wcwidth(0xFF1A) (FULLWIDTH COLON) returns 0. dnl This leads to bugs in 'ls' (coreutils). + dnl On Solaris 11.4, wcwidth(0x2202) (PARTIAL DIFFERENTIAL) returns 2, + dnl even in Western locales. AC_CACHE_CHECK([whether wcwidth works reasonably in UTF-8 locales], [gl_cv_func_wcwidth_works], [ @@ -80,7 +82,7 @@ int wcwidth (int); int main () { int result = 0; - if (setlocale (LC_ALL, "fr_FR.UTF-8") != NULL) + if (setlocale (LC_ALL, "en_US.UTF-8") != NULL) { if (wcwidth (0x0301) > 0) result |= 1; @@ -90,6 +92,8 @@ int main () result |= 4; if (wcwidth (0xFF1A) == 0) result |= 8; + if (wcwidth (0x2202) > 1) + result |= 16; } return result; }]])], diff --git a/tests/test-wcwidth.c b/tests/test-wcwidth.c index eb7bdd2..8e9cea3 100644 --- a/tests/test-wcwidth.c +++ b/tests/test-wcwidth.c @@ -72,6 +72,22 @@ main () ASSERT (wcwidth (0x200B) == 0); ASSERT (wcwidth (0xFEFF) <= 0); + /* Test width of some math symbols. + U+2202 is marked as having ambiguous width (A) in EastAsianWidth.txt + (see <https://www.unicode.org/Public/12.0.0/ucd/EastAsianWidth.txt>). + The Unicode Standard Annex 11 + <https://www.unicode.org/reports/tr11/tr11-36.html> + says + "Ambiguous characters behave like wide or narrow characters + depending on the context (language tag, script identification, + associated font, source of data, or explicit markup; all can + provide the context). If the context cannot be established + reliably, they should be treated as narrow characters by default." + For wcwidth(), the only available context information is the locale. + "fr_FR.UTF-8" is a Western locale, not an East Asian locale, therefore + U+2202 should be treated like a narrow character. */ + ASSERT (wcwidth (0x2202) == 1); + /* Test width of some CJK characters. */ ASSERT (wcwidth (0x3000) == 2); ASSERT (wcwidth (0xB250) == 2); diff --git a/doc/posix-functions/wcwidth.texi b/doc/posix-functions/wcwidth.texi index 741be8e..ecdf758 100644 --- a/doc/posix-functions/wcwidth.texi +++ b/doc/posix-functions/wcwidth.texi @@ -18,6 +18,10 @@ glibc 2.8. This function handles combining characters in UTF-8 locales incorrectly on some platforms: Mac OS X 10.3, OpenBSD 5.8. +@item +This function returns 2 for characters with ambiguous east asian width, even in +Western locales, on some platforms: +Solaris 11.4. @end itemize Portability problems not fixed by Gnulib: