IFS delimiter field separation issues
I ran into a strange bug using newer versions of bash, I haven't isolated it to a specific release. OS1: Oracle Enterprise linux 9,4 bash 5.1.8(1) OS2: Gentoo linux bash version 5.2.37 older bash: OS3: centos linux 7.9 bash 4.2.46(2) In using unicode group separator character U 241D, https://www.compart.com/en/unicode/U+241D, 0x241D I set the IFS to this unicode, and have U+241E and U+241F characters in the data. When assigning to an array, and using for var in "${array[@]}"... it ends up splitting the data at unexpected locations. I don't get this behaviour when the array isn't quoted examples quoted and unquoted: quoted: exampledatasetBeginning quoted: example1� quoted: �value 1 quoted: example2� quoted: �value 2� quoted: �data2� quoted: �secondarydata quoted: example3� quoted: �value3 unquoted: exampledatasetBeginning unquoted: example1␞value 1 unquoted: example2␞value 2␟data2␞secondarydata unquoted: example3␞value3 In older versions of bash, this behaves as expected. I wrote a script that will easily reproduce this: --- export LC_CTYPE="en_US.UTF-8" export GS=$'\u241D' export RS=$'\u241e' export US=$'\u241f' export TAB=$'\t' export NEWLINE=$'\n' function testscript() { #create a copy of IFS to revert local OLD_IFS=${IFS} #make local so it doesn't effect global local IFS=${IFS} local var1= local arr1=() local arr2=() local entry= local data="exampledatasetBeginning${GS}example1${RS}value 1${GS}example2${RS}value${NEWLINE}2${US}data2${RS}secondarydata${GS}example3${RS}value${TAB}3" echo "GS:${GS} RS:${RS} US:${US}" var1=${data} # Show the difference echo "---unquoted echo-" echo ${var1} echo "quoted echo--" echo "${var1}" echo "-" #set IFS and assign variable to an array IFS=${GS} arr1=( ${var1} ) # Has strange field splitting issues for entry in "${arr1[@]}"; do echo "quoted: ${entry}" done # Seems to work as expected for entry in ${arr1[@]}; do echo "unquoted: ${entry}" done echo "---Quoted array--" #loop over array to get data sets arr2=( "${var1}" ) # Behavior is a little different for the # quoted iteration, it includes extra spaces for entry in "${arr2[@]}"; do echo "quoted: ${entry}" done # Behaves the same way as arr1 unquoted for entry in ${arr2[@]}; do echo "unquoted: ${entry}" done IFS=${OLD_IFS} } testscript --- I noticed a few other differences when compared to older bash, as in the quoted array assignment and quoted for loop provides no separation, as expected. I didn't see anything specific that may cause this, and did try testing with a few other unicode characters, though they were printable characters, and they did not have the same behaviour, but it was not an exhaustive test. My issue/concern is that I expect quoted to be the best and safest way to keep the data fields as I expect them to be first separated here by Group Separator, and can later use the other fields to further split data, since my data can potentially have common separator characters. If anybody has thoughts, or even where to begin looking as to where this might be, it would be greatly helpful. Thanks Jeff
Re: IFS delimiter field separation issues
Excellent! I can stop trying to dig in to the code and understand where all the word expansions happen. So strange to find those one off bugs, and great that it was only one. Do you have, or working on a patch that can be applied to a build? Thanks Jeff On Thu, Jan 9, 2025 at 1:20 PM Chet Ramey wrote: > On 1/8/25 1:25 PM, Jeff Ketchum wrote: > > I ran into a strange bug using newer versions of bash, I haven't isolated > > it to a specific release. > > > > OS1: Oracle Enterprise linux 9,4 bash 5.1.8(1) > > OS2: Gentoo linux bash version 5.2.37 > > older bash: > > OS3: centos linux 7.9 bash 4.2.46(2) > > > > In using unicode group separator character U 241D, > > https://www.compart.com/en/unicode/U+241D, 0x241D > > I set the IFS to this unicode, and have U+241E and U+241F characters in > the > > data. > > When assigning to an array, and using for var in "${array[@]}"... > > it ends up splitting the data at unexpected locations. > > Thanks for the report. This turned out to be an easy fix: there was one > place (one!) where word expansion didn't take into account that multibyte > characters can be protected by bash's internal quoting. > > Chet > -- > ``The lyf so short, the craft so long to lerne.'' - Chaucer > ``Ars longa, vita brevis'' - Hippocrates > Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/ >
variable replacement text differences
I noticed a strange issue with some variable replacement text getting replaced in a weird way. It may be intentional. I just want to understand the differences and if it is, or if its a newer bug. Originally i was doing this to prepend data to an array like "${array_[@]/#/${variable}}" I simplified it down to just string replacement. $ cat replacestring.sh original_string="1|2|3|4" replace_string=':\\' echo "original: ${original_string} replace:${replace_string}" echo "unquoted ${original_string/2/${replace_string}}" echo "quoted ${original_string/2/"${replace_string}"}" GNU bash, version 5.2.37(1)-release (x86_64-pc-linux-gnu) $ bash replacestring.sh original: 1|2|3|4 replace::\\ unquoted 1|:\|3|4 quoted 1|:\\|3|4 on older versions, this was a little different: GNU bash, version 4.1.2(2)-release (x86_64-redhat-linux-gnu) $ bash /tmp/replacestring.sh original: 1|2|3|4 replace::\\ unquoted 1|:\\|3|4 quoted 1|":\\"|3|4 Then, there was an inbetween version: (custom but based on this) GNU bash, version 5.1.16(1)-release (x86_64-pc-linux-musl) original: 1|2|3|4 replace::\\ unquoted 1|:\\|3|4 quoted 1|:\\|3|4 newer versions, unquoted the replacement loses a backslash old version, the quoted version has the quotes as part of the replacement Jeff
Re: variable replacement text differences
I think that helps me understand the differences better, and what I am seeing. Though It doesn't seem like it is completely consistent, and not what I expected when using a variable with specific layout. (and also breaking change enabled by default) example, if i change the replacement to '\a' it stays as \a $ bash replacestring.sh original: 1|2|3|4 replace:\a unqouted 1|\a|3|4 qouted 1|\a|3|4 so, it seems it only escapes it if its a double backslash, or escaping a & and it is different again, if i change the script to do \a manually $ cat replacestring.sh original_string="1|2|3|4" replace_string='\a' echo "original: ${original_string} replace:${replace_string}" echo "unqouted ${original_string/2/${replace_string}}" echo "qouted ${original_string/2/"${replace_string}"}" echo "manual ${original_string/2/\a}" output for newer bash: ... manual 1|a|3|4 this is slightly different behaviour from the variable, but in older versions, it shows it GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu) $ bash /tmp/replacestring.sh original: 1|2|3|4 replace:\a unqouted 1|\a|3|4 qouted 1|"\a"|3|4 manual 1|\a|3|4 It could also be a double quote, '\"' that is escaped, and it doesn't interpret the \ as an escape, which is a character that I would expect to see that happen for. also, a single \ doesn't have to be quoted either replace_string='\' Anyway, this does help, as I can look into turning it off, and understand the behaviour better. Jeff On Thu, Jun 19, 2025 at 4:11 PM Lawrence Velázquez wrote: > On Thu, Jun 19, 2025, at 5:28 PM, Jeff Ketchum wrote: > > $ cat replacestring.sh > > original_string="1|2|3|4" > > replace_string=':\\' > > echo "original: ${original_string} replace:${replace_string}" > > echo "unquoted ${original_string/2/${replace_string}}" > > echo "quoted ${original_string/2/"${replace_string}"}" > > > > [...] > > > > on older versions, this was a little different: > > GNU bash, version 4.1.2(2)-release (x86_64-redhat-linux-gnu) > > $ bash /tmp/replacestring.sh > > original: 1|2|3|4 replace::\\ > > unquoted 1|:\\|3|4 > > quoted 1|":\\"|3|4 > > > > > > Then, there was an inbetween version: (custom but based on this) > > GNU bash, version 5.1.16(1)-release (x86_64-pc-linux-musl) > > original: 1|2|3|4 replace::\\ > > unquoted 1|:\\|3|4 > > quoted 1|:\\|3|4 > > This changed in bash 4.3: > > There is one incompatible change between bash-4.2 and > bash-4.3. Bash now performs quote removal on the replacement > string in pattern substitution (${param/pat/rep}), since > the shell treats quotes as special. If you have to quote > single quotes to get them to be treated literally, the shell > should perform quote removal on them. > > https://lists.gnu.org/archive/html/bug-bash/2014-02/msg00081.html > > > [...] > > > > GNU bash, version 5.2.37(1)-release (x86_64-pc-linux-gnu) > > $ bash replacestring.sh > > original: 1|2|3|4 replace::\\ > > unquoted 1|:\|3|4 > > quoted 1|:\\|3|4 > > This changed in bash 5.2: > > There is a new shell option, `patsub_replacement'. When > enabled, a `&' in the replacement string of the pattern > substitution expansion is replaced by the portion of the > string that matched the pattern. Backslash will escape the > `&' and insert a literal `&'. This option is enabled by > default. > > (Since '\' is now an escape character, the fact that it escapes > itself is implied.) > > https://lists.gnu.org/archive/html/bug-bash/2022-09/msg00056.html > > -- > vq >