"builtin printf '\uFF8E'" generates broken surrogate pairs in Cygwin

Koichi MURASE Sat, 05 Nov 2016 23:07:58 -0700

Hello, let me send a bashbug report as follows.


Configuration Information [Automatically generated, do not change]:
Machine: i686
OS: cygwin
Compiler: gcc
Compilation CFLAGS:  -DPROGRAM='bash.exe' -DCONF_HOSTTYPE='i686'
-DCONF_OSTYPE='cygwin' -DCONF_MACHTYPE='i686-pc-cygwin'
-DCONF_VENDOR='pc' -DLOCALEDIR='/usr/share/locale' -DPACKAGE='bash'
-DSHELL -DHAVE_CONFIG_H -DRECYCLES_PIDS   -I.
-I/usr/src/bash-4.3.46-7.i686/src/bash-4.3
-I/usr/src/bash-4.3.46-7.i686/src/bash-4.3/include
-I/usr/src/bash-4.3.46-7.i686/src/bash-4.3/lib  -DWORDEXP_OPTION -ggdb
-O2 -pipe -Wimplicit-function-declaration
-fdebug-prefix-map=/usr/src/bash-4.3.46-7.i686/build=/usr/src/debug/bash-4.3.46-7
-fdebug-prefix-map=/usr/src/bash-4.3.46-7.i686/src/bash-4.3=/usr/src/debug/bash-4.3.46-7
uname output: CYGWIN_NT-10.0-WOW magnate2016 2.6.0(0.304/5/3)
2016-08-31 14:27 i686 Cygwin
Machine Type: i686-pc-cygwin

Bash Version: 4.3
Patch Level: 46
Release Status: release

Description:

  I noticed that built-in commands "printf '\uFF8E'", etc. generate
broken surrogate pairs in Cygwin.

Repeat-By:

  $ echo $MACHTYPE
  i686-pc-cygwin
  $ echo $LANG
  ja_JP.UTF-8
  $ printf '\uFF8E\n' # <-- U+FF8E is "halfwidth kana Ho", one of
Japanese characters.
  ??                  # <-- Some unknown characters are output.
  $ /bin/printf '\uFF8E\n'
  ﾎ                   # <-- OK with /bin/printf
  $ printf '\uFF8E' | od -t x1 -A n
   ed 9f bf ed be 8e  # <-- This is utf-8 representation of <U+D7FF U+DF8E>.

  Here one notices that <U+D7FF U+DF8E> is a broken surrogate pair.
The first element of surrogate pairs should be in the range from
U+D800 to U+DBFF, and the second should be in the range from U+DC00 to
U+DFFF. Anyway, the character U+FF8E cannot be represented by a
surrogate pair.

Fix:

  I think the function "u32toutf16 (c, s)" in lib/sh/unicode.c is
broken. Note that this function is only used in systems where "sizeof
(wchar_t) == 2". Cygwin is one of them. Also, I checked the latest
version of bash-4.4 (patch level 0) source codes, and the function is
not yet fixed there: The characters in the range from U+E000 to U+FFFF
should not be encoded in surrogate pairs; they don't have
surrogate-pair representations.

diff --git a/lib/sh/unicode.c b/lib/sh/unicode.c
index b58eaef..29acac6 100644
--- a/lib/sh/unicode.c
+++ b/lib/sh/unicode.c
@@ -219,12 +219,12 @@ u32toutf16 (c, s)
   int l;

   l = 0;
-  if (c < 0x0d800)
+  if (c < 0x0d800 || (c >= 0x0e000 && c <= 0x0ffff))
     {
       s[0] = (unsigned short) (c & 0xFFFF);
       l = 1;
     }
-  else if (c >= 0x0e000 && c <= 0x010ffff)
+  else if (c >= 0x10000 && c <= 0x010ffff)
     {
       c -= 0x010000;
       s[0] = (unsigned short)((c >> 10) + 0xd800);


Regards,

Koichi

"builtin printf '\uFF8E'" generates broken surrogate pairs in Cygwin

Reply via email to