Configuration Information [Automatically generated, do not change]: Machine: x86_64 OS: linux-gnu Compiler: gcc Compilation CFLAGS: -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wall uname output: Linux mars 5.10.0-9-amd64 #1 SMP Debian 5.10.70-1 (2021-09-30) x86_64 GNU/Linux Machine Type: x86_64-pc-linux-gnu
Bash Version: 5.1 Patch Level: 4 Release Status: release Description: Bash sometimes corrupts multibyte characters in command substitutions. I found the bug with the bash version as shipped with Debian bullseye, but it can be reproduced with an unmodified bash as well. The attached script shows how to build it (the configure options given there seem necessary to trigger the bug, except "-g" which I needed for debugging). The bug is very fragile and depends heavily on things like the length of filler characters in the script, of environment variables, even unrelated ones, and even of the current working directory name. Therefore the attached script is a wrapper that tries to reproduce the conditions for the bug to occur. I'm not sure if anything else from my environment is relevant; if it doesn't reproduce the bug for you, you can trying playing with things like environment variables, filler lines in the script etc. The wrapper then calls the actual buggy script (a trimmed-down version of my actual script exhibiting the bug) which is the lower here-document in the script. It's meant to read input from stdin (here, 511 spaces and a 2-byte UTF-8 character, so it crosses a 512-byte boundary), and output it unchanged (with a trailing newline which is irrelevent here), so the expected output is: 20 ... 20 c3 a4 0a But when the bug occurs, it gives: 20 ... 20 c3 90 a4 0a (The wrongly inserted byte may be something else instead of "90".) I traced the bug to subst.c:6244: mblen = mbrtowc (&wc, bufp-1, bufn+1, &ps); Here, bufn+1 is too big by 1, so the function will overrun the input data, and thus here the buf array, so UB. (That's why the bug is so fragile; that stuff needs to dirty exactly the memory location which is wrongly read here.) However, I'd say the actual cause of the bug is rather the handling of bufn in the read loop. After a char is consumed from the buffer (6207), bufn is not decremented until the next loop iteration, 6199: if (--bufn <= 0) This means (a) bufn is decremented once too many at the start (which is compensated for by using "<=" where otherwise "==" would do), and (b) bufn is too big by 1 for the rest of the loop. So far, the only place where it matters is the mblen call above, so the bug could be avoided by subtracting 1 there, but I think it's more robust to decrement bufn when consuming the char to avoid this pitfall for future changes, so that's what my patch does. Repeat-By: Running the attached wrapper script. Fix: I've included my patch in the wrapper script, activated by setting "patched=y", so it can easily be tested in the same environment; you can just extract it from there.
bash-utf8-bug
Description: Binary data