[long] [x-y] bash range wildcard behaviour with non-ASCII data is very random especially with globasciiranges on

2021-02-07 Thread Stephane Chazelas
Hello,

I was wondering why in my en_GB.UTF-8 locale, [0-9] matched
"only" on 1044 characters in bash 5.1 while in bash 4.4 it used
to match on 1050 different ones.

It turns out it's because since 5.0, the globasciiranges option
is enabled by default. Then I tried to understand what that
option was actually doing, but the more I tested, the least
sense it made and the whole thing seems to be quite buggy to me.

The manual says:

DOC> 'globasciiranges'
DOC>  If set, range expressions used in pattern matching bracket
DOC>  expressions (*note Pattern Matching::) behave as if in the
DOC>  traditional C locale when performing comparisons.  That is,
DOC>  the current locale's collating sequence is not taken into
DOC>  account, so 'b' will not collate between 'A' and 'B', and
DOC>  upper-case and lower-case ASCII characters will collate
DOC>  together.

In the C locale, POSIX defines the collation order as being the
same as the order of characters in the ASCII character set even
if the C locale's charmap is not ASCII like on EBCDIC systems
(and all other characters if any in the C locale's charmap
(which have to be single-bytes) have to sort after ^? the last
ASCII character). On all systems I've ever used, in the C
locale, the charset was ASCII, and the collation order was based
on the byte value of the encoding (strcoll() is equivalent to
strcmp()), even on characters that are undefined (the ones with
encoding 0x80 to 0xff).

Yet, the DOC above doesn't reflect what happens in bash in
multibyte locales (the norm these days), as bash still appears
to (sometimes at least) decode sequences of bytes into the
corresponding character in the user's locale, not in the C
locale and use the locale's collation order.

I should point out that I've since read:
https://lists.gnu.org/archive/html/bug-bash/2018-08/msg00027.html
https://lists.gnu.org/archive/html/bug-bash/2019-03/msg00145.html
(and https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html)
but those barely scratched the surface.

I had a look at the code to try and understand what was going
on, and here are my findings below:

What I found was that the behaviour of [x-y] ranges in wildcards
depended on:

- the setting of the globasciiranges option
- whether the locale uses a single-byte-charset or not
- in locales with multibyte characters
  - whether the pattern and subject contain sequences of bytes
that don't form valid characters
  - whether the pattern and subject contain only single byte
characters
  - whether the wide char value of the characters are in the
0..255 range or not.

(I've not looked at the effect of nocasematch/nocaseglob).

== locales with single-byte charset

Let's take those out of the way as they are the simplest:

In single-byte per character locales (which were common until
the late 90s), [x-y] matches on characters whose byte encoding
is (numerically) between that of x and that of y when
the globasciiranges is on as it is by default since 5.0 (whether
the characters are in the ASCII set or not or whether the
locale's charset is a superset of ASCII or not; in other words,
it has little to do with ASCII).

When globasciiranges is off, [x-y] matches on characters c that
collate between x and y, but with an additional check: if c
collates the same as x but has a byte value that is less than
that of x or collates the same as y but has a byte value greater
than that of y, then it won't be included (a good thing IMO).

== multi-byte charset locales

=== invalid text

First, independently of ranges, bash pattern matching operates on
two different modes whether the input and pattern are valid text
in the locale or not.

If the pattern or subject contains sequences of bytes that don't
form valid characters in the locale, then the pattern matching
works at byte level 

For instance, in a UTF-8 locale.

[[ $string = [é-é] ]]

matches on strings that start with é and are followed by exactly
4 characters as long as $string contains valid UTF-8 text, but
if not, the test becomes: [[ $string = [\xc3\xa9-\xc3-\xa9] ]]
where [\xc3\xa9-\xc3-\xa9] matches on byte 0xc3 or bytes 0xa9
to 0xc3 or byte 0xa9 (so bytes 0xa9 to 0xc3) and ? matches a
single byte (including each byte of each valid multibyte
character). For instance:

$ string=$'áé\x80' bash -c '[[ $string = [é-é] ]]' && echo yes
yes

as that's the 0xc3 0xa1 0xc3 0xa9 0x80 byte sequence.

[[ é = *$'\xa9' ]] and [[ á = [é-é]*$'\xa1' ]] both match, this
time because the *pattern* is not valid UTF-8.

Or in 

$ LANG=zh_HK.big5hkscs luit
$ bash --norc
bash-5.1$ [[ '*' = [β*] ]] && echo yes
yes
bash-5.1$ [[ $'αwhatever\xff]' = [β*] ]] && echo yes
yes

A pattern meant to match any of two characters also matches
strings of any length as it become a completely different
pattern once applied to a string that contains sequence of bytes
not forming valid characters in the locale.

In that same locale:

bash-5.1$ pat='α\*'
bash-5.1$ [[ 'α*' =

unprintable characters in PS1 different on 5.0.3 vs. 5.0.18?

2021-02-07 Thread n952162

Hi,

I use this string as my prompt:

  $?${boldon}$UCODE\w${boldoff}>

where boldon and boldoff are gotten from "tput smso" and "tput rmso". 
Works fine on my x86_64 boxes at 5.0.18(1) but on my raspberry-pi, at
5.0.3(1), I need to use \[ and \].

Am I not seeing something?





Re: unprintable characters in PS1 different on 5.0.3 vs. 5.0.18?

2021-02-07 Thread Chet Ramey

On 2/7/21 11:51 AM, n952162 wrote:

Hi,

I use this string as my prompt:

   $?${boldon}$UCODE\w${boldoff}>

where boldon and boldoff are gotten from "tput smso" and "tput rmso".
Works fine on my x86_64 boxes at 5.0.18(1) but on my raspberry-pi, at
5.0.3(1), I need to use \[ and \].


You always need to use \[and \] to delimit sequences of non-printable
characters. If it worked without them, you were lucky.


--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/



Re: unprintable characters in PS1 different on 5.0.3 vs. 5.0.18?

2021-02-07 Thread n952162

On 2/7/21 6:04 PM, Chet Ramey wrote:

On 2/7/21 11:51 AM, n952162 wrote:

Hi,

I use this string as my prompt:

   $?${boldon}$UCODE\w${boldoff}>

where boldon and boldoff are gotten from "tput smso" and "tput rmso".
Works fine on my x86_64 boxes at 5.0.18(1) but on my raspberry-pi, at
5.0.3(1), I need to use \[ and \].


You always need to use \[and \] to delimit sequences of non-printable
characters. If it worked without them, you were lucky.



ok, thank you.