Configuration Information [Automatically generated, do not change]:
Machine: x86_64
OS: cygwin
Compiler: gcc
Compilation CFLAGS: -DPROGRAM='bash.exe' -DCONF_HOSTTYPE='x86_64'
-DCONF_OSTYPE='cygwin' -DCONF_MACHTYPE='x86_64-unknown-cygwin'
-DCONF_VENDOR='unknown' -DLOCALEDIR='/usr/share/locale'
-DPACKAGE='bash' -DSHELL -DHAVE_CONFIG_H -DRECYCLES_PIDS -I.
-I/usr/src/bash-4.4.12-3.x86_64/src/bash-4.4
-I/usr/src/bash-4.4.12-3.x86_64/src/bash-4.4/include
-I/usr/src/bash-4.4.12-3.x86_64/src/bash-4.4/lib -DWORDEXP_OPTION
-ggdb -O2 -pipe -Wimplicit-function-declaration
-fdebug-prefix-map=/usr/src/bash-4.4.12-3.x86_64/build=/usr/src/debug/bash-4.4.12-3
-fdebug-prefix-map=/usr/src/bash-4.4.12-3.x86_64/src/bash-4.4=/usr/src/debug/bash-4.4.12-3
-Wno-parentheses -Wno-format-security
uname output: CYGWIN_NT-10.0 letsnote2019 3.3.3(0.341/5/3) 2021-12-03
16:35 x86_64 Cygwin
Machine Type: x86_64-unknown-cygwin

Bash Version: 4.4
Patch Level: 12
Release Status: release

Description:

  With a multi-byte encoding that has a non-trivial intermediate state
  (mbstate_t), « printf %d "'<char>" » can be affected by the internal
  mbstate_t of `mbtowc'/`mblen' to produce a wrong result.  Also, it
  can leave the internal mbstate_t in an intermediate state.

  This is because `mbtowc', which uses the internal mbstate_t, is used
  by the printf builtin to get the character code of <char>.  Instead,
  `mbrtowc' that receives `mbstate_t *' as an argument can be used
  with a properly initialized mbstate_t.  In the codebase, there are
  several other similar codes relying on an undefined state of the
  internal mbstate_t.

Repeat-By:

  I faced a problem when I tried to get character codes of U+1XXXX
  [i.e., Unicode characters outside Basic Multilingual Plane (BMP)
  whose code points are larger than U+FFFF] in a UTF-8 locale in
  Cygwin and MSYS2, in which sizeof(wchar_t) == 2.  For example,
  consider the following command:

  $ printf '<%x>' $'"\U1F6D1'{1..4};echo

  We expect four identical hex numbers as the result because the
  character after the double quote is always <U+1F6D1> for all the
  four arguments.  However, the actual result becomes

  [bash-4.4/cygwin]$ printf '<%x>' $'"\U1F6D1'{1..4};echo
  <d83d><0><d83d><0>

  [bash-5.1/msys2]$ printf '<%x>' $'"\U1F6D1'{1..4};echo
  <d83d><f0><d83d><f0>

  The above behaviors are caused in the following way: In systems
  where sizeof (wchar_t) == 2, such as Cygwin and MSYS2, the character
  codes of U+1XXXX do not fit in one wchar_t, so `mbtowc'/`mbrtowc`
  wants to produce a surrogate pair consisting of two wchar_t.

  1. For the first call of `mbtowc', a high surrogate (U+D800..DBFF)
    is generated, and the remaining information needed to produce a
    low surrogate (U+DC00..DFFF) is stored in mbstate_t.

  2. The printf builtin tries to extract the code of the second
    argument using `mbtowc' without clearing the internal mbstate_t.
    This causes a decode error and results in <0> (or a fallback
    interpretation of a remaining byte, <f0>, in bash 5.0+).

  I expect the result <d83d><d83d><d83d><d83d> in Cygwin/MSYS2.

Fix:

  I attach a patch, r0036-avoid-internal-mbstate.patch.txt, which
  includes the following changes:

  * asciicode (builtins/printf.def): Even though `mbstate_t state' was
    declared by `DECLARE_MBSTATE', it was not used in the original
    code.  We can use `mbrtowc' instead of `mbtowc', where we can pass
    `state' to the fourth argument of `mbrtowc'.

  There are also other similar cases relying on an uncontrolled
  intermediate internal mbstate_t and affecting the internal
  mbstate_t:

  * mbscasecmp (lib/sh/mbscasecmp.c), mbscmp (lib/sh/mbscmp.c): In
    these functions, two intermediate states for two independent
    strings are mixed.  We can declare two distinct mbstate_t
    instances and initialize and use them.

  * indirection_level_string (print_cmd.c), setifs (subst.c): These
    functions also depended on an undefined internal mbstate_t.  We
    can declare and initialize mbstate_t `state' by `DECLARE_MBSTATE'
    and use it.

  * string_extract_verbatim (subst.c): `MBRLEN' and `mbrtowc' should
    be called using the current mbstate_t stored in `state'.  Not to
    affect `state' used by `ADVANCE_CHAR', we can first copy the value
    to another mbstate_t, `mbstmp', and pass it to `MBRLEN' and
    `mbrtowc'.

  The patch includes cleanup of a macro that becomes unused:

  * include/shmbutil.h: I have removed the macro `MBLEN' in the patch
    because it is not used in the codebase anymore and also because it
    has the general problem and seems to be unlikely used in the
    future.

--
Koichi
From 2fb725034eabf2035f1abf815345b2941922ca01 Mon Sep 17 00:00:00 2001
From: Koichi Murase <myoga.mur...@gmail.com>
Date: Sat, 24 Sep 2022 12:55:17 +0900
Subject: [PATCH 1/2] shmbutil: avoid intermediate mbstate

---
 builtins/printf.def |  2 +-
 include/shmbutil.h  |  2 --
 lib/sh/mbscasecmp.c |  7 +++----
 lib/sh/mbscmp.c     |  7 +++----
 print_cmd.c         |  3 ++-
 subst.c             | 18 ++++++++++++------
 6 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/builtins/printf.def b/builtins/printf.def
index 84658c39..abc41b76 100644
--- a/builtins/printf.def
+++ b/builtins/printf.def
@@ -1343,7 +1343,7 @@ asciicode ()
 #if defined (HANDLE_MULTIBYTE)
   slen = strlen (garglist->word->word+1);
   wc = 0;
-  mblength = mbtowc (&wc, garglist->word->word+1, slen);
+  mblength = mbrtowc (&wc, garglist->word->word+1, slen, &state);
   if (mblength > 0)
     ch = wc;           /* XXX */
   else
diff --git a/include/shmbutil.h b/include/shmbutil.h
index 0f711eab..1b1db37e 100644
--- a/include/shmbutil.h
+++ b/include/shmbutil.h
@@ -48,7 +48,6 @@ extern int locale_utf8locale; /* XXX */
 #define MBSLEN(s)      (((s) && (s)[0]) ? ((s)[1] ? mbstrlen (s) : 1) : 0)
 #define MB_STRLEN(s)   ((MB_CUR_MAX > 1) ? MBSLEN (s) : STRLEN (s))
 
-#define MBLEN(s, n)    ((MB_CUR_MAX > 1) ? mblen ((s), (n)) : 1)
 #define MBRLEN(s, n, p)        ((MB_CUR_MAX > 1) ? mbrlen ((s), (n), (p)) : 1)
 
 #define UTF8_SINGLEBYTE(c)     (((c) & 0x80) == 0)
@@ -73,7 +72,6 @@ extern int locale_utf8locale; /* XXX */
 
 #define MB_STRLEN(s)           (STRLEN(s))
 
-#define MBLEN(s, n)            1
 #define MBRLEN(s, n, p)                1
 
 #ifndef wchar_t
diff --git a/lib/sh/mbscasecmp.c b/lib/sh/mbscasecmp.c
index 0ab95605..3c0ac471 100644
--- a/lib/sh/mbscasecmp.c
+++ b/lib/sh/mbscasecmp.c
@@ -37,16 +37,15 @@ mbscasecmp (mbs1, mbs2)
 {
   int len1, len2, mb_cur_max;
   wchar_t c1, c2, l1, l2;
+  mbstate_t mbstat1 = { 0 }, mbstat2 = { 0 };
 
   len1 = len2 = 0;
-  /* Reset multibyte characters to their initial state.         */
-  (void) mblen ((char *) NULL, 0);
 
   mb_cur_max = MB_CUR_MAX;
   do
     {
-      len1 = mbtowc (&c1, mbs1, mb_cur_max);
-      len2 = mbtowc (&c2, mbs2, mb_cur_max);
+      len1 = mbrtowc (&c1, mbs1, mb_cur_max, &mbstat1);
+      len2 = mbrtowc (&c2, mbs2, mb_cur_max, &mbstat2);
 
       if (len1 == 0)
        return len2 == 0 ? 0 : -1;
diff --git a/lib/sh/mbscmp.c b/lib/sh/mbscmp.c
index c7c84435..aa00c719 100644
--- a/lib/sh/mbscmp.c
+++ b/lib/sh/mbscmp.c
@@ -38,16 +38,15 @@ mbscmp (mbs1, mbs2)
 {
   int len1, len2, mb_cur_max;
   wchar_t c1, c2;
+  mbstate_t mbstat1 = { 0 }, mbstat2 = { 0 };
 
   len1 = len2 = 0;
-  /* Reset multibyte characters to their initial state.         */
-  (void) mblen ((char *) NULL, 0);
 
   mb_cur_max = MB_CUR_MAX;
   do
     {
-      len1 = mbtowc (&c1, mbs1, mb_cur_max);
-      len2 = mbtowc (&c2, mbs2, mb_cur_max);
+      len1 = mbrtowc (&c1, mbs1, mb_cur_max, &mbstat1);
+      len2 = mbrtowc (&c2, mbs2, mb_cur_max, &mbstat2);
 
       if (len1 == 0)
        return len2 == 0 ? 0 : -1;
diff --git a/print_cmd.c b/print_cmd.c
index eef9bb6a..406417be 100644
--- a/print_cmd.c
+++ b/print_cmd.c
@@ -451,6 +451,7 @@ indirection_level_string ()
   char *ps4;
   char ps4_firstc[MB_LEN_MAX+1];
   int ps4_firstc_len, ps4_len, ineed, old;
+  DECLARE_MBSTATE;
 
   ps4 = get_string_value ("PS4");
   if (indirection_string == 0)
@@ -473,7 +474,7 @@ indirection_level_string ()
 
 #if defined (HANDLE_MULTIBYTE)
   ps4_len = strnlen (ps4, MB_CUR_MAX);
-  ps4_firstc_len = MBLEN (ps4, ps4_len);
+  ps4_firstc_len = MBRLEN (ps4, ps4_len, &state);
   if (ps4_firstc_len == 1 || ps4_firstc_len == 0 || ps4_firstc_len < 0)
     {
       ps4_firstc[0] = ps4[0];
diff --git a/subst.c b/subst.c
index d9feabca..d0c9b04d 100644
--- a/subst.c
+++ b/subst.c
@@ -1171,12 +1171,13 @@ string_extract_verbatim (string, slen, sindex, 
charlist, flags)
      int flags;
 {
   register int i;
-#if defined (HANDLE_MULTIBYTE)
-  wchar_t *wcharlist;
-#endif
   int c;
   char *temp;
+#if defined (HANDLE_MULTIBYTE)
+  wchar_t *wcharlist;
+  mbstate_t mbstmp;
   DECLARE_MBSTATE;
+#endif
 
   if ((flags & SX_NOCTLESC) && charlist[0] == '\'' && charlist[1] == '\0')
     {
@@ -1226,11 +1227,15 @@ string_extract_verbatim (string, slen, sindex, 
charlist, flags)
       if (locale_utf8locale && slen > i && UTF8_SINGLEBYTE (string[i]))
        mblength = (string[i] != 0) ? 1 : 0;
       else
-       mblength = MBLEN (string + i, slen - i);
+       {
+         mbstmp = state;
+         mblength = MBRLEN (string + i, slen - i, &mbstmp);
+       }
       if (mblength > 1)
        {
          wchar_t wc;
-         mblength = mbtowc (&wc, string + i, slen - i);
+         mbstmp = state;
+         mblength = mbrtowc (&wc, string + i, slen - i, &mbstmp);
          if (MB_INVALIDCH (mblength))
            {
              if (MEMBER (c, charlist))
@@ -12020,8 +12025,9 @@ setifs (v)
       else
        {
          size_t ifs_len;
+         DECLARE_MBSTATE;
          ifs_len = strnlen (ifs_value, MB_CUR_MAX);
-         ifs_firstc_len = MBLEN (ifs_value, ifs_len);
+         ifs_firstc_len = MBRLEN (ifs_value, ifs_len, &state);
        }
       if (ifs_firstc_len == 1 || ifs_firstc_len == 0 || MB_INVALIDCH 
(ifs_firstc_len))
        {
-- 
2.37.2

Reply via email to