Corrupted multibyte characters in command substitutions

Frank Heckenbach Sat, 01 Jan 2022 18:21:10 -0800

Configuration Information [Automatically generated, do not change]:
Machine: x86_64
OS: linux-gnu
Compiler: gcc
Compilation CFLAGS: -g -O2 -fstack-protector-strong -Wformat 
-Werror=format-security -Wall 
uname output: Linux mars 5.10.0-9-amd64 #1 SMP Debian 5.10.70-1 (2021-09-30) 
x86_64 GNU/Linux
Machine Type: x86_64-pc-linux-gnu


Bash Version: 5.1
Patch Level: 4
Release Status: release

Description:

Bash sometimes corrupts multibyte characters in command
substitutions.

I found the bug with the bash version as shipped with Debian
bullseye, but it can be reproduced with an unmodified bash as well.
The attached script shows how to build it (the configure options
given there seem necessary to trigger the bug, except "-g" which I
needed for debugging).

The bug is very fragile and depends heavily on things like the
length of filler characters in the script, of environment variables,
even unrelated ones, and even of the current working directory name.

Therefore the attached script is a wrapper that tries to reproduce
the conditions for the bug to occur. I'm not sure if anything else
from my environment is relevant; if it doesn't reproduce the bug for
you, you can trying playing with things like environment variables,
filler lines in the script etc.

The wrapper then calls the actual buggy script (a trimmed-down
version of my actual script exhibiting the bug) which is the lower
here-document in the script.

It's meant to read input from stdin (here, 511 spaces and a 2-byte
UTF-8 character, so it crosses a 512-byte boundary), and output it
unchanged (with a trailing newline which is irrelevent here), so the
expected output is:

20 ... 20 c3 a4 0a

But when the bug occurs, it gives:

20 ... 20 c3 90 a4 0a

(The wrongly inserted byte may be something else instead of "90".)

I traced the bug to subst.c:6244:

          mblen = mbrtowc (&wc, bufp-1, bufn+1, &ps);

Here, bufn+1 is too big by 1, so the function will overrun the input
data, and thus here the buf array, so UB. (That's why the bug is so
fragile; that stuff needs to dirty exactly the memory location which
is wrongly read here.)

However, I'd say the actual cause of the bug is rather the handling
of bufn in the read loop. After a char is consumed from the buffer
(6207), bufn is not decremented until the next loop iteration, 6199:

      if (--bufn <= 0)

This means (a) bufn is decremented once too many at the start (which
is compensated for by using "<=" where otherwise "==" would do), and
(b) bufn is too big by 1 for the rest of the loop.

So far, the only place where it matters is the mblen call above, so
the bug could be avoided by subtracting 1 there, but I think it's
more robust to decrement bufn when consuming the char to avoid this
pitfall for future changes, so that's what my patch does.

Repeat-By:

Running the attached wrapper script.

Fix:

I've included my patch in the wrapper script, activated by setting
"patched=y", so it can easily be tested in the same environment; you
can just extract it from there.

bash-utf8-bug
Description: Binary data

Corrupted multibyte characters in command substitutions

Reply via email to