IFS delimiter field separation issues

2025-01-08 Thread Jeff Ketchum
I ran into a strange bug using newer versions of bash, I haven't isolated
it to a specific release.

OS1: Oracle Enterprise linux 9,4 bash 5.1.8(1)
OS2: Gentoo linux bash version 5.2.37
older bash:
OS3: centos linux 7.9 bash 4.2.46(2)

In using unicode group separator character U 241D,
https://www.compart.com/en/unicode/U+241D, 0x241D
I set the IFS to this unicode, and have U+241E and U+241F characters in the
data.
When assigning to an array, and using for var in "${array[@]}"...
it ends up splitting the data at unexpected locations.

I don't get this behaviour when the array isn't quoted

examples quoted and unquoted:
quoted:   exampledatasetBeginning
quoted:   example1�
quoted:   �value 1
quoted:   example2�
quoted:   �value
2�
quoted:   �data2�
quoted:   �secondarydata
quoted:   example3�
quoted:   �value3
unquoted: exampledatasetBeginning
unquoted: example1␞value 1
unquoted: example2␞value
2␟data2␞secondarydata
unquoted: example3␞value3


In older versions of bash, this behaves as expected.

I wrote a script that will easily reproduce this:
---

export LC_CTYPE="en_US.UTF-8"

export GS=$'\u241D'
export RS=$'\u241e'
export US=$'\u241f'
export TAB=$'\t'
export NEWLINE=$'\n'

function testscript() {
#create a copy of IFS to revert
local OLD_IFS=${IFS}
#make local so it doesn't effect global
local IFS=${IFS}
local var1=
local arr1=()
local arr2=()
local entry=
local data="exampledatasetBeginning${GS}example1${RS}value
1${GS}example2${RS}value${NEWLINE}2${US}data2${RS}secondarydata${GS}example3${RS}value${TAB}3"

echo "GS:${GS} RS:${RS} US:${US}"
var1=${data}

# Show the difference
echo "---unquoted echo-"
echo ${var1}
echo "quoted echo--"
echo "${var1}"
echo "-"

#set IFS and assign variable to an array
IFS=${GS}

arr1=( ${var1} )

# Has strange field splitting issues
for entry in "${arr1[@]}"; do
echo "quoted:   ${entry}"
done

# Seems to work as expected
for entry in ${arr1[@]}; do
echo "unquoted: ${entry}"
done

echo "---Quoted array--"

#loop over array to get data sets
arr2=( "${var1}" )

# Behavior is a little different for the
# quoted iteration, it includes extra spaces
for entry in "${arr2[@]}"; do
echo "quoted:   ${entry}"
done

# Behaves the same way as arr1 unquoted
for entry in ${arr2[@]}; do
echo "unquoted: ${entry}"
done
IFS=${OLD_IFS}
}

testscript

---

I noticed a few other differences when compared to older bash, as in the
quoted array assignment and quoted for loop provides no separation, as
expected.

I didn't see anything specific that may cause this, and did try testing
with a few other unicode characters, though they were printable characters,
and they did not have the same behaviour, but it was not an exhaustive test.

My issue/concern is that I expect quoted to be the best and safest way to
keep the data fields as I expect them to be first separated here by Group
Separator, and can later use the other fields to further split data, since
my data can potentially have common separator characters.

If anybody has thoughts, or even where to begin looking as to where this
might be, it would be greatly helpful.

Thanks
Jeff


Re: IFS delimiter field separation issues

2025-01-09 Thread Jeff Ketchum
Excellent! I can stop trying to dig in to the code and understand where all
the word expansions happen.

So strange to find those one off bugs, and great that it was only one.

Do you have, or working on a patch that can be applied to a build?

Thanks
Jeff

On Thu, Jan 9, 2025 at 1:20 PM Chet Ramey  wrote:

> On 1/8/25 1:25 PM, Jeff Ketchum wrote:
> > I ran into a strange bug using newer versions of bash, I haven't isolated
> > it to a specific release.
> >
> > OS1: Oracle Enterprise linux 9,4 bash 5.1.8(1)
> > OS2: Gentoo linux bash version 5.2.37
> > older bash:
> > OS3: centos linux 7.9 bash 4.2.46(2)
> >
> > In using unicode group separator character U 241D,
> > https://www.compart.com/en/unicode/U+241D, 0x241D
> > I set the IFS to this unicode, and have U+241E and U+241F characters in
> the
> > data.
> > When assigning to an array, and using for var in "${array[@]}"...
> > it ends up splitting the data at unexpected locations.
>
> Thanks for the report. This turned out to be an easy fix: there was one
> place (one!) where word expansion didn't take into account that multibyte
> characters can be protected by bash's internal quoting.
>
> Chet
> --
> ``The lyf so short, the craft so long to lerne.'' - Chaucer
>  ``Ars longa, vita brevis'' - Hippocrates
> Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/
>