Hello, let me send a bashbug report as follows.
Configuration Information [Automatically generated, do not change]: Machine: i686 OS: cygwin Compiler: gcc Compilation CFLAGS: -DPROGRAM='bash.exe' -DCONF_HOSTTYPE='i686' -DCONF_OSTYPE='cygwin' -DCONF_MACHTYPE='i686-pc-cygwin' -DCONF_VENDOR='pc' -DLOCALEDIR='/usr/share/locale' -DPACKAGE='bash' -DSHELL -DHAVE_CONFIG_H -DRECYCLES_PIDS -I. -I/usr/src/bash-4.3.46-7.i686/src/bash-4.3 -I/usr/src/bash-4.3.46-7.i686/src/bash-4.3/include -I/usr/src/bash-4.3.46-7.i686/src/bash-4.3/lib -DWORDEXP_OPTION -ggdb -O2 -pipe -Wimplicit-function-declaration -fdebug-prefix-map=/usr/src/bash-4.3.46-7.i686/build=/usr/src/debug/bash-4.3.46-7 -fdebug-prefix-map=/usr/src/bash-4.3.46-7.i686/src/bash-4.3=/usr/src/debug/bash-4.3.46-7 uname output: CYGWIN_NT-10.0-WOW magnate2016 2.6.0(0.304/5/3) 2016-08-31 14:27 i686 Cygwin Machine Type: i686-pc-cygwin Bash Version: 4.3 Patch Level: 46 Release Status: release Description: I noticed that built-in commands "printf '\uFF8E'", etc. generate broken surrogate pairs in Cygwin. Repeat-By: $ echo $MACHTYPE i686-pc-cygwin $ echo $LANG ja_JP.UTF-8 $ printf '\uFF8E\n' # <-- U+FF8E is "halfwidth kana Ho", one of Japanese characters. ?? # <-- Some unknown characters are output. $ /bin/printf '\uFF8E\n' ホ # <-- OK with /bin/printf $ printf '\uFF8E' | od -t x1 -A n ed 9f bf ed be 8e # <-- This is utf-8 representation of <U+D7FF U+DF8E>. Here one notices that <U+D7FF U+DF8E> is a broken surrogate pair. The first element of surrogate pairs should be in the range from U+D800 to U+DBFF, and the second should be in the range from U+DC00 to U+DFFF. Anyway, the character U+FF8E cannot be represented by a surrogate pair. Fix: I think the function "u32toutf16 (c, s)" in lib/sh/unicode.c is broken. Note that this function is only used in systems where "sizeof (wchar_t) == 2". Cygwin is one of them. Also, I checked the latest version of bash-4.4 (patch level 0) source codes, and the function is not yet fixed there: The characters in the range from U+E000 to U+FFFF should not be encoded in surrogate pairs; they don't have surrogate-pair representations. diff --git a/lib/sh/unicode.c b/lib/sh/unicode.c index b58eaef..29acac6 100644 --- a/lib/sh/unicode.c +++ b/lib/sh/unicode.c @@ -219,12 +219,12 @@ u32toutf16 (c, s) int l; l = 0; - if (c < 0x0d800) + if (c < 0x0d800 || (c >= 0x0e000 && c <= 0x0ffff)) { s[0] = (unsigned short) (c & 0xFFFF); l = 1; } - else if (c >= 0x0e000 && c <= 0x010ffff) + else if (c >= 0x10000 && c <= 0x010ffff) { c -= 0x010000; s[0] = (unsigned short)((c >> 10) + 0xd800); Regards, Koichi