On 2/25/19 11:17 AM, Olga Ustuzhanina wrote: > Bash Version: 5.0 > Patch Level: 2 > Release Status: release > > Description: > When using `IFS= read -r -d '' input` to read null-delimited > strings on a system with bash 5.0+ and UTF-8 locale, you can > encounter situation when one of strings being read ends in a > character in range \xC2-\xFD (inclusive) and the next string is > empty. Something like this: "...\xC2\0\0...." > > We would expect `read` to read this string up until \0 right > after \xC2 so that the next `read` will get an empty string > (from first \0 to second \0) and third will read the rest of > the string, past second \0. > > Turns out this isn't the case. In reality, first 'read' > loads the expected part of the string, but second one > actually actually reads the rest of the string, not the > expected empty substring. > > Repeat-By: > # Reproduces bug on Bash 5.0+ with LANG set to a > # UTF-8 locale (en_US.UTF-8) > > # First, let's make a function that translates a > # null-delimited list into a comma-delimited list > > ntc() { > while IFS= read -r -d '' input; do > printf "$input;" > done > } > > # It works in general case: > > printf "a\0b\0c\0d\0" | ntc | xxd > > # But when some element of a list ends in a character from 0xC2 to > # 0xFD # and the next element is empty, we end up with the empty > # element being lost > > printf "\xc2\0\0\0\0" | ntc | xxd
This is an invalid multibyte character. The \xc2 is the valid first byte of a multibyte character, but the next byte read makes the sequence invalid. The read builtin resynchronizes on the following byte. There's currently no facility to push back the invalid parts of a multibyte character. There might be a way to do it if the read is buffered inside bash, but the `-d' option makes it unbuffered. Chet -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRU c...@case.edu http://tiswww.cwru.edu/~chet/