Configuration Information [Automatically generated, do not change]: Machine: x86_64 OS: cygwin Compiler: gcc Compilation CFLAGS: -DPROGRAM='bash.exe' -DCONF_HOSTTYPE='x86_64' -DCONF_OSTYPE='cygwin' -DCONF_MACHTYPE='x86_64-unknown-cygwin' -DCONF_VENDOR='unknown' -DLOCALEDIR='/usr/share/locale' -DPACKAGE='bash' -DSHELL -DHAVE_CONFIG_H -DRECYCLES_PIDS -I. -I/usr/src/bash-4.4.12-3.x86_64/src/bash-4.4 -I/usr/src/bash-4.4.12-3.x86_64/src/bash-4.4/include -I/usr/src/bash-4.4.12-3.x86_64/src/bash-4.4/lib -DWORDEXP_OPTION -ggdb -O2 -pipe -Wimplicit-function-declaration -fdebug-prefix-map=/usr/src/bash-4.4.12-3.x86_64/build=/usr/src/debug/bash-4.4.12-3 -fdebug-prefix-map=/usr/src/bash-4.4.12-3.x86_64/src/bash-4.4=/usr/src/debug/bash-4.4.12-3 -Wno-parentheses -Wno-format-security uname output: CYGWIN_NT-10.0 letsnote2019 3.3.3(0.341/5/3) 2021-12-03 16:35 x86_64 Cygwin Machine Type: x86_64-unknown-cygwin
Bash Version: 4.4 Patch Level: 12 Release Status: release Description: With a multi-byte encoding that has a non-trivial intermediate state (mbstate_t), « printf %d "'<char>" » can be affected by the internal mbstate_t of `mbtowc'/`mblen' to produce a wrong result. Also, it can leave the internal mbstate_t in an intermediate state. This is because `mbtowc', which uses the internal mbstate_t, is used by the printf builtin to get the character code of <char>. Instead, `mbrtowc' that receives `mbstate_t *' as an argument can be used with a properly initialized mbstate_t. In the codebase, there are several other similar codes relying on an undefined state of the internal mbstate_t. Repeat-By: I faced a problem when I tried to get character codes of U+1XXXX [i.e., Unicode characters outside Basic Multilingual Plane (BMP) whose code points are larger than U+FFFF] in a UTF-8 locale in Cygwin and MSYS2, in which sizeof(wchar_t) == 2. For example, consider the following command: $ printf '<%x>' $'"\U1F6D1'{1..4};echo We expect four identical hex numbers as the result because the character after the double quote is always <U+1F6D1> for all the four arguments. However, the actual result becomes [bash-4.4/cygwin]$ printf '<%x>' $'"\U1F6D1'{1..4};echo <d83d><0><d83d><0> [bash-5.1/msys2]$ printf '<%x>' $'"\U1F6D1'{1..4};echo <d83d><f0><d83d><f0> The above behaviors are caused in the following way: In systems where sizeof (wchar_t) == 2, such as Cygwin and MSYS2, the character codes of U+1XXXX do not fit in one wchar_t, so `mbtowc'/`mbrtowc` wants to produce a surrogate pair consisting of two wchar_t. 1. For the first call of `mbtowc', a high surrogate (U+D800..DBFF) is generated, and the remaining information needed to produce a low surrogate (U+DC00..DFFF) is stored in mbstate_t. 2. The printf builtin tries to extract the code of the second argument using `mbtowc' without clearing the internal mbstate_t. This causes a decode error and results in <0> (or a fallback interpretation of a remaining byte, <f0>, in bash 5.0+). I expect the result <d83d><d83d><d83d><d83d> in Cygwin/MSYS2. Fix: I attach a patch, r0036-avoid-internal-mbstate.patch.txt, which includes the following changes: * asciicode (builtins/printf.def): Even though `mbstate_t state' was declared by `DECLARE_MBSTATE', it was not used in the original code. We can use `mbrtowc' instead of `mbtowc', where we can pass `state' to the fourth argument of `mbrtowc'. There are also other similar cases relying on an uncontrolled intermediate internal mbstate_t and affecting the internal mbstate_t: * mbscasecmp (lib/sh/mbscasecmp.c), mbscmp (lib/sh/mbscmp.c): In these functions, two intermediate states for two independent strings are mixed. We can declare two distinct mbstate_t instances and initialize and use them. * indirection_level_string (print_cmd.c), setifs (subst.c): These functions also depended on an undefined internal mbstate_t. We can declare and initialize mbstate_t `state' by `DECLARE_MBSTATE' and use it. * string_extract_verbatim (subst.c): `MBRLEN' and `mbrtowc' should be called using the current mbstate_t stored in `state'. Not to affect `state' used by `ADVANCE_CHAR', we can first copy the value to another mbstate_t, `mbstmp', and pass it to `MBRLEN' and `mbrtowc'. The patch includes cleanup of a macro that becomes unused: * include/shmbutil.h: I have removed the macro `MBLEN' in the patch because it is not used in the codebase anymore and also because it has the general problem and seems to be unlikely used in the future. -- Koichi
From 2fb725034eabf2035f1abf815345b2941922ca01 Mon Sep 17 00:00:00 2001 From: Koichi Murase <myoga.mur...@gmail.com> Date: Sat, 24 Sep 2022 12:55:17 +0900 Subject: [PATCH 1/2] shmbutil: avoid intermediate mbstate --- builtins/printf.def | 2 +- include/shmbutil.h | 2 -- lib/sh/mbscasecmp.c | 7 +++---- lib/sh/mbscmp.c | 7 +++---- print_cmd.c | 3 ++- subst.c | 18 ++++++++++++------ 6 files changed, 21 insertions(+), 18 deletions(-) diff --git a/builtins/printf.def b/builtins/printf.def index 84658c39..abc41b76 100644 --- a/builtins/printf.def +++ b/builtins/printf.def @@ -1343,7 +1343,7 @@ asciicode () #if defined (HANDLE_MULTIBYTE) slen = strlen (garglist->word->word+1); wc = 0; - mblength = mbtowc (&wc, garglist->word->word+1, slen); + mblength = mbrtowc (&wc, garglist->word->word+1, slen, &state); if (mblength > 0) ch = wc; /* XXX */ else diff --git a/include/shmbutil.h b/include/shmbutil.h index 0f711eab..1b1db37e 100644 --- a/include/shmbutil.h +++ b/include/shmbutil.h @@ -48,7 +48,6 @@ extern int locale_utf8locale; /* XXX */ #define MBSLEN(s) (((s) && (s)[0]) ? ((s)[1] ? mbstrlen (s) : 1) : 0) #define MB_STRLEN(s) ((MB_CUR_MAX > 1) ? MBSLEN (s) : STRLEN (s)) -#define MBLEN(s, n) ((MB_CUR_MAX > 1) ? mblen ((s), (n)) : 1) #define MBRLEN(s, n, p) ((MB_CUR_MAX > 1) ? mbrlen ((s), (n), (p)) : 1) #define UTF8_SINGLEBYTE(c) (((c) & 0x80) == 0) @@ -73,7 +72,6 @@ extern int locale_utf8locale; /* XXX */ #define MB_STRLEN(s) (STRLEN(s)) -#define MBLEN(s, n) 1 #define MBRLEN(s, n, p) 1 #ifndef wchar_t diff --git a/lib/sh/mbscasecmp.c b/lib/sh/mbscasecmp.c index 0ab95605..3c0ac471 100644 --- a/lib/sh/mbscasecmp.c +++ b/lib/sh/mbscasecmp.c @@ -37,16 +37,15 @@ mbscasecmp (mbs1, mbs2) { int len1, len2, mb_cur_max; wchar_t c1, c2, l1, l2; + mbstate_t mbstat1 = { 0 }, mbstat2 = { 0 }; len1 = len2 = 0; - /* Reset multibyte characters to their initial state. */ - (void) mblen ((char *) NULL, 0); mb_cur_max = MB_CUR_MAX; do { - len1 = mbtowc (&c1, mbs1, mb_cur_max); - len2 = mbtowc (&c2, mbs2, mb_cur_max); + len1 = mbrtowc (&c1, mbs1, mb_cur_max, &mbstat1); + len2 = mbrtowc (&c2, mbs2, mb_cur_max, &mbstat2); if (len1 == 0) return len2 == 0 ? 0 : -1; diff --git a/lib/sh/mbscmp.c b/lib/sh/mbscmp.c index c7c84435..aa00c719 100644 --- a/lib/sh/mbscmp.c +++ b/lib/sh/mbscmp.c @@ -38,16 +38,15 @@ mbscmp (mbs1, mbs2) { int len1, len2, mb_cur_max; wchar_t c1, c2; + mbstate_t mbstat1 = { 0 }, mbstat2 = { 0 }; len1 = len2 = 0; - /* Reset multibyte characters to their initial state. */ - (void) mblen ((char *) NULL, 0); mb_cur_max = MB_CUR_MAX; do { - len1 = mbtowc (&c1, mbs1, mb_cur_max); - len2 = mbtowc (&c2, mbs2, mb_cur_max); + len1 = mbrtowc (&c1, mbs1, mb_cur_max, &mbstat1); + len2 = mbrtowc (&c2, mbs2, mb_cur_max, &mbstat2); if (len1 == 0) return len2 == 0 ? 0 : -1; diff --git a/print_cmd.c b/print_cmd.c index eef9bb6a..406417be 100644 --- a/print_cmd.c +++ b/print_cmd.c @@ -451,6 +451,7 @@ indirection_level_string () char *ps4; char ps4_firstc[MB_LEN_MAX+1]; int ps4_firstc_len, ps4_len, ineed, old; + DECLARE_MBSTATE; ps4 = get_string_value ("PS4"); if (indirection_string == 0) @@ -473,7 +474,7 @@ indirection_level_string () #if defined (HANDLE_MULTIBYTE) ps4_len = strnlen (ps4, MB_CUR_MAX); - ps4_firstc_len = MBLEN (ps4, ps4_len); + ps4_firstc_len = MBRLEN (ps4, ps4_len, &state); if (ps4_firstc_len == 1 || ps4_firstc_len == 0 || ps4_firstc_len < 0) { ps4_firstc[0] = ps4[0]; diff --git a/subst.c b/subst.c index d9feabca..d0c9b04d 100644 --- a/subst.c +++ b/subst.c @@ -1171,12 +1171,13 @@ string_extract_verbatim (string, slen, sindex, charlist, flags) int flags; { register int i; -#if defined (HANDLE_MULTIBYTE) - wchar_t *wcharlist; -#endif int c; char *temp; +#if defined (HANDLE_MULTIBYTE) + wchar_t *wcharlist; + mbstate_t mbstmp; DECLARE_MBSTATE; +#endif if ((flags & SX_NOCTLESC) && charlist[0] == '\'' && charlist[1] == '\0') { @@ -1226,11 +1227,15 @@ string_extract_verbatim (string, slen, sindex, charlist, flags) if (locale_utf8locale && slen > i && UTF8_SINGLEBYTE (string[i])) mblength = (string[i] != 0) ? 1 : 0; else - mblength = MBLEN (string + i, slen - i); + { + mbstmp = state; + mblength = MBRLEN (string + i, slen - i, &mbstmp); + } if (mblength > 1) { wchar_t wc; - mblength = mbtowc (&wc, string + i, slen - i); + mbstmp = state; + mblength = mbrtowc (&wc, string + i, slen - i, &mbstmp); if (MB_INVALIDCH (mblength)) { if (MEMBER (c, charlist)) @@ -12020,8 +12025,9 @@ setifs (v) else { size_t ifs_len; + DECLARE_MBSTATE; ifs_len = strnlen (ifs_value, MB_CUR_MAX); - ifs_firstc_len = MBLEN (ifs_value, ifs_len); + ifs_firstc_len = MBRLEN (ifs_value, ifs_len, &state); } if (ifs_firstc_len == 1 || ifs_firstc_len == 0 || MB_INVALIDCH (ifs_firstc_len)) { -- 2.37.2