Re: Some byte combinations affect UTF-8 string reading

Chet Ramey Mon, 25 Feb 2019 11:32:49 -0800

On 2/25/19 11:17 AM, Olga Ustuzhanina wrote:

> Bash Version: 5.0
> Patch Level: 2
> Release Status: release
> 
> Description:
>       When using  `IFS= read -r -d '' input` to read null-delimited
>       strings on a system with bash 5.0+ and UTF-8 locale, you can
>       encounter situation when one of strings being read ends in a
>       character in range \xC2-\xFD (inclusive) and the next string is
>       empty. Something like this: "...\xC2\0\0...."
> 
>       We would expect `read` to read this string up until \0 right
>       after \xC2 so that the next `read` will get an empty string
>       (from first \0 to second \0) and third will read the rest of
>       the string, past second \0.
> 
>       Turns out this isn't the case. In reality, first 'read'
>       loads the expected part of the string, but second one
>       actually actually reads the rest of the string, not the
>       expected empty substring.
> 
> Repeat-By:
>       # Reproduces bug on Bash 5.0+ with LANG set to a
>       # UTF-8 locale (en_US.UTF-8)
> 
>       # First, let's make a function that translates a
>       # null-delimited list into a comma-delimited list 
> 
>       ntc() {
>               while IFS= read -r -d '' input; do
>                       printf "$input;"
>               done
>       }
> 
>       # It works in general case:
> 
>       printf "a\0b\0c\0d\0" | ntc | xxd
> 
>       # But when some element of a list ends in a character from 0xC2 to
>       # 0xFD # and the next element is empty, we end up with the empty
>       # element being lost
> 
>       printf "\xc2\0\0\0\0" | ntc | xxd


This is an invalid multibyte character. The \xc2 is the valid first byte
of a multibyte character, but the next byte read makes the sequence
invalid. The read builtin resynchronizes on the following byte. There's
currently no facility to push back the invalid parts of a multibyte
character. There might be a way to do it if the read is buffered inside
bash, but the `-d' option makes it unbuffered.

Chet


-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    c...@case.edu    http://tiswww.cwru.edu/~chet/

Re: Some byte combinations affect UTF-8 string reading

Reply via email to