Configuration Information [Automatically generated, do not change]: Machine: x86_64 OS: linux-gnu Compiler: cc Compilation CFLAGS: -fstack-clash-protection -D_FORTIFY_SOURCE=2 -mtune=generic -O2 -pipe -DSYS_BASHRC='/etc/bash/bashrc' -g -Wno-parentheses -Wno-format-security uname output: Linux laserbook 4.20.12_1 #1 SMP PREEMPT Sat Feb 23 15:05:07 UTC 2019 x86_64 GNU/Linux Machine Type: x86_64-unknown-linux-gnu
Bash Version: 5.0 Patch Level: 2 Release Status: release Description: When using `IFS= read -r -d '' input` to read null-delimited strings on a system with bash 5.0+ and UTF-8 locale, you can encounter situation when one of strings being read ends in a character in range \xC2-\xFD (inclusive) and the next string is empty. Something like this: "...\xC2\0\0...." We would expect `read` to read this string up until \0 right after \xC2 so that the next `read` will get an empty string (from first \0 to second \0) and third will read the rest of the string, past second \0. Turns out this isn't the case. In reality, first 'read' loads the expected part of the string, but second one actually actually reads the rest of the string, not the expected empty substring. Repeat-By: # Reproduces bug on Bash 5.0+ with LANG set to a # UTF-8 locale (en_US.UTF-8) # First, let's make a function that translates a # null-delimited list into a comma-delimited list ntc() { while IFS= read -r -d '' input; do printf "$input;" done } # It works in general case: printf "a\0b\0c\0d\0" | ntc | xxd # But when some element of a list ends in a character from 0xC2 to # 0xFD # and the next element is empty, we end up with the empty # element being lost printf "\xc2\0\0\0\0" | ntc | xxd # But, setting LANG='C' makes the issue go away printf "\xc2\0\0\0\0" | LANG='C' ntc | xxd # Also, characters outside of C2-FE range work fine printf "\xc1\0\0\0\0" | ntc | xxd printf "\xfe\0\0\0\0" | ntc | xxd printf "\x9f\0\0\0\0" | ntc | xxd