Some byte combinations affect UTF-8 string reading

Olga Ustuzhanina Mon, 25 Feb 2019 08:47:44 -0800

Configuration Information [Automatically generated, do not change]:
Machine: x86_64
OS: linux-gnu
Compiler: cc
Compilation CFLAGS: -fstack-clash-protection -D_FORTIFY_SOURCE=2
-mtune=generic -O2 -pipe  -DSYS_BASHRC='/etc/bash/bashrc' -g
-Wno-parentheses -Wno-format-security uname output: Linux laserbook
4.20.12_1 #1 SMP PREEMPT Sat Feb 23 15:05:07 UTC 2019 x86_64 GNU/Linux
Machine Type: x86_64-unknown-linux-gnu


Bash Version: 5.0
Patch Level: 2
Release Status: release

Description:
        When using  `IFS= read -r -d '' input` to read null-delimited
        strings on a system with bash 5.0+ and UTF-8 locale, you can
        encounter situation when one of strings being read ends in a
        character in range \xC2-\xFD (inclusive) and the next string is
        empty. Something like this: "...\xC2\0\0...."

        We would expect `read` to read this string up until \0 right
        after \xC2 so that the next `read` will get an empty string
        (from first \0 to second \0) and third will read the rest of
        the string, past second \0.

        Turns out this isn't the case. In reality, first 'read'
        loads the expected part of the string, but second one
        actually actually reads the rest of the string, not the
        expected empty substring.

Repeat-By:
        # Reproduces bug on Bash 5.0+ with LANG set to a
        # UTF-8 locale (en_US.UTF-8)

        # First, let's make a function that translates a
        # null-delimited list into a comma-delimited list 

        ntc() {
                while IFS= read -r -d '' input; do
                        printf "$input;"
                done
        }

        # It works in general case:

        printf "a\0b\0c\0d\0" | ntc | xxd

        # But when some element of a list ends in a character from 0xC2 to
        # 0xFD # and the next element is empty, we end up with the empty
        # element being lost

        printf "\xc2\0\0\0\0" | ntc | xxd

        # But, setting LANG='C' makes the issue go away

        printf "\xc2\0\0\0\0" | LANG='C' ntc | xxd

        # Also, characters outside of C2-FE range work fine

        printf "\xc1\0\0\0\0" | ntc | xxd
        printf "\xfe\0\0\0\0" | ntc | xxd
        printf "\x9f\0\0\0\0" | ntc | xxd

Some byte combinations affect UTF-8 string reading

Reply via email to