[long] [x-y] bash range wildcard behaviour with non-ASCII data is very random especially with globasciiranges on

Stephane Chazelas Sun, 07 Feb 2021 06:37:14 -0800

Hello,

I was wondering why in my en_GB.UTF-8 locale, [0-9] matched
"only" on 1044 characters in bash 5.1 while in bash 4.4 it used
to match on 1050 different ones.


It turns out it's because since 5.0, the globasciiranges option
is enabled by default. Then I tried to understand what that
option was actually doing, but the more I tested, the least
sense it made and the whole thing seems to be quite buggy to me.

The manual says:

DOC> 'globasciiranges'
DOC>      If set, range expressions used in pattern matching bracket
DOC>      expressions (*note Pattern Matching::) behave as if in the
DOC>      traditional C locale when performing comparisons.  That is,
DOC>      the current locale's collating sequence is not taken into
DOC>      account, so 'b' will not collate between 'A' and 'B', and
DOC>      upper-case and lower-case ASCII characters will collate
DOC>      together.

In the C locale, POSIX defines the collation order as being the
same as the order of characters in the ASCII character set even
if the C locale's charmap is not ASCII like on EBCDIC systems
(and all other characters if any in the C locale's charmap
(which have to be single-bytes) have to sort after ^? the last
ASCII character). On all systems I've ever used, in the C
locale, the charset was ASCII, and the collation order was based
on the byte value of the encoding (strcoll() is equivalent to
strcmp()), even on characters that are undefined (the ones with
encoding 0x80 to 0xff).

Yet, the DOC above doesn't reflect what happens in bash in
multibyte locales (the norm these days), as bash still appears
to (sometimes at least) decode sequences of bytes into the
corresponding character in the user's locale, not in the C
locale and use the locale's collation order.

I should point out that I've since read:
https://lists.gnu.org/archive/html/bug-bash/2018-08/msg00027.html
https://lists.gnu.org/archive/html/bug-bash/2019-03/msg00145.html
(and https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html)
but those barely scratched the surface.

I had a look at the code to try and understand what was going
on, and here are my findings below:

What I found was that the behaviour of [x-y] ranges in wildcards
depended on:

- the setting of the globasciiranges option
- whether the locale uses a single-byte-charset or not
- in locales with multibyte characters
  - whether the pattern and subject contain sequences of bytes
    that don't form valid characters
  - whether the pattern and subject contain only single byte
    characters
  - whether the wide char value of the characters are in the
    0..255 range or not.

(I've not looked at the effect of nocasematch/nocaseglob).

== locales with single-byte charset

Let's take those out of the way as they are the simplest:

In single-byte per character locales (which were common until
the late 90s), [x-y] matches on characters whose byte encoding
is (numerically) between that of x and that of y when
the globasciiranges is on as it is by default since 5.0 (whether
the characters are in the ASCII set or not or whether the
locale's charset is a superset of ASCII or not; in other words,
it has little to do with ASCII).

When globasciiranges is off, [x-y] matches on characters c that
collate between x and y, but with an additional check: if c
collates the same as x but has a byte value that is less than
that of x or collates the same as y but has a byte value greater
than that of y, then it won't be included (a good thing IMO).

== multi-byte charset locales

=== invalid text

First, independently of ranges, bash pattern matching operates on
two different modes whether the input and pattern are valid text
in the locale or not.

If the pattern or subject contains sequences of bytes that don't
form valid characters in the locale, then the pattern matching
works at byte level 

For instance, in a UTF-8 locale.

[[ $string = [é-é]???? ]]

matches on strings that start with é and are followed by exactly
4 characters as long as $string contains valid UTF-8 text, but
if not, the test becomes: [[ $string = [\xc3\xa9-\xc3-\xa9]???? ]]
where [\xc3\xa9-\xc3-\xa9] matches on byte 0xc3 or bytes 0xa9
to 0xc3 or byte 0xa9 (so bytes 0xa9 to 0xc3) and ? matches a
single byte (including each byte of each valid multibyte
character). For instance:

$ string=$'áé\x80' bash -c '[[ $string = [é-é]???? ]]' && echo yes
yes

as that's the 0xc3 0xa1 0xc3 0xa9 0x80 byte sequence.

[[ é = *$'\xa9' ]] and [[ á = [é-é]*$'\xa1' ]] both match, this
time because the *pattern* is not valid UTF-8.

Or in 

$ LANG=zh_HK.big5hkscs luit
$ bash --norc
bash-5.1$ [[ '*' = [β*] ]] && echo yes
yes
bash-5.1$ [[ $'αwhatever\xff]' = [β*] ]] && echo yes
yes

A pattern meant to match any of two characters also matches
strings of any length as it become a completely different
pattern once applied to a string that contains sequence of bytes
not forming valid characters in the locale.

In that same locale:

bash-5.1$ pat='α\*'
bash-5.1$ [[ 'α*' = $pat ]] && echo yes
yes
bash-5.1$ [[ 'αblah\xffblah' = $pat ]] && echo yes
yes

=== valid text

==== globasciiranges option off

If we're in a case where both subject and pattern are valid
text, with globasciiranges off (default before 5.0, after shopt
-u globasciiranges in 5.0+), ranges are based on locale
collation order like for single-byte locales. That explains for
instance why E or É are matched by [a-z] in a typical
en_US.UTF-8 locale.

==== globasciiranges option on

That's where it gets a bit random.

First if neither the pattern nor subject contain multibyte
characters (and the pattern doesn't contain non-standard
character classes!?), we're back to the byte-wise operator
above. In theory, that shouldn't be a valid optimisation as it's
not guaranteed that the wchar_t value of a single byte character
be the same as that byte value (not even for 0..127 ones, see
https://www.quora.com/What-exactly-does-__STDC_MB_MIGHT_NEQ_WC__-stand-for?share=1),
but in practice, at least on GNU, FreeBSD and Solaris, that
seems to hold true for all the locales I've looked at.

Now, for a character c to be matched by [x-y], it will also
depend on whether the wide char value of c, x and y is less than
256 or not.

When comparing c with x or y to see if it's in the range, if
both c and x (or c and y) are less than 256, then it compares
the wchar_t value numerically. If not, then we use the collation
order (with the same additional check as above if characters
collate the same).

What the wide char value is meant to be is not specified. On GNU
systems, it's the Unicode code point.

For instance for ¥ (U+00A5, the Yen currency symbol) which is
encoded as 0xa5 in a en_US.ISO8859-1 locale, 0xc2 0xa5 in
en_US.UTF-8, 0xa2 0x44 in zh_TW.Big5, 0x81 0x30 0x84 0x36 in
zh_CN.GB18030 (0x5c in the Shift-JIS charset), on a GNU system,
the wide char value will always be 0xa5. So, there, the
characters that have a wide-character value < 256 will be the
one from the iso8859-1 charset (the Unicode characters U+0001 to
U+00FF).

On FreeBSD, that wide char value will depend on the locale's
character encoding. For ¥, that is 0xa5 in en_US.ISO8859-1 and
en_US.UTF-8, 0xa244 in zh_TW.Big5, 0x1308436 in zh_CN.GB18030,
0x5c in ja_JP.SJIS (yes, FreeBSD has a locale with that charset
even though it doesn't have a \ nor ~ character!). So the list
of characters with a wide char value < 256 will vary with the
locale.

The only thing we can probably reliably count on is that on
ASCII-based system, for ASCII characters, the wchar value will
be the byte value. On GNU, FreeBSD and Solaris, I've found that
in UTF-8 locales (the ones that matter these days), it was
always the unicode code point.

That explains why in a en_US.UTF-8 locale for instance [0-9]
matches on 1050 different characters with globasciiranges off but
"only" 1044 when globasciiranges is on. The 6 missing ones are
the ones with codepoints in the 0..255 range that collate
between 0 and 9 but have a code point not within 0x30..0x39
(²³¹¼½¾, U+00B2 U+00B3 U+00B9 U+00BC U+00BD U+00BE).

That also explains why with globasciiranges on in that same
locale, [a-f] no longer matches on EéÉ as those are in iso8859-1
(code point below 256), but still matches on the
ĒĔĖĘĚȄȆȨḔḖḘḚḜẸẺẼẾỀỂỄỆ upper case characters, and hundreds more
with a code point above 255.

On FreeBSD/Solaris and in a zh_CN.GB18030 locale, [a-f] would
match on éÉ though as the wchar_t value of those would be
greater than 255 (the wchar_t value there is not the Unicode
code point).

== how about [x-y] in the [[ =~ ]] operator?

In that case, bash calls the system's regex API. So the
behaviour you get will vary with the system and locale, and
there's not much bash can do about it.

On GNU systems, I regularly check what [0-9] matches. It
used to match on thousands of characters like bash wildcards.
Then only a few locales matched characters other than
0123456789. Today (glibc 2.31), I can't find any locale where
regexec() or fnmatch() matches any character other than
0123456789. But that seems to be handled as a special case.
[0-a] still matches the same 1050 characters as bash wildcards
do, plus "a" plus a number of variations on the digit 9 (that
sorts *after* 9).

[a-z] matches on 1367 characters (though no uppercase ones so
roughly half as many as with bash wildcards), though that also
seems to be handled as a special case.

Collating elements (and all the surprises it entails) are
supported:

$ LC_ALL=hu_HU.UTF-8 bash -c '[[ dzs =~ ^[a-z]$ ]]' && echo yes
yes

There is some support for non-text data. For instance a
standalone 0x80 byte in a UTF-8 locale won't be matched by . nor
[^anything], but is matched by a 0x80 byte in the regexp as long
as it's not in a bracket expression ([[ $'\x80' =~ ^$'\x80'$ ]]
matches but [[ $'\x80' =~ ^[$'\x80']$ ]] doesn't). It's even
possible to match inside a character ([[ é =~ ^$'\xc3' ]]
matches).

== When is globasciiranges useful?

AFAICT, globasciiranges only works as expected when using ASCII
only patterns to match on exclusively ASCII data on an ASCII
system. And mostly so that [a-z] no longer matches uppercase
English letters and [A-Z] no longer matches lower case English
letters.

But even then, setting LC_CTYPE and LC_COLLATE to C works better
as it does the same, but also doesn't match random non-ASCII
characters like [a-z] with globasciiranges does.

== What could be done to sort out this mess?

Here bash has two modes: with or without globasciiranges.

If feels to me that both modes could be improved so they match
users expectations better.

Here's an approach I propose below to start the discussion,
please let me know what you think:

It would make sense to me for the globasciiranges to work like
in gawk or zsh (to be more globcodepointranges) and
non-globasciiranges to work like those GNU regexps described
above except for the handling of bytes not forming parts of
characters (and I for one can still live without collating
element matching matching support). 

? and [^x]/[!x] should match them and * should match across
them. So that *.txt matches on all file names that end in .txt
even if the leading part is not valid text in the current locale
(like it does atm but for the wrong reasons).

It's important because not doing so can allow one to fool sanity
checks such as:

case $input in
  ("" | *[!0123456789]*) die "need a decimal integer";;
esac

Several languages have gone for python3's approach at
considering bytes in filenames not forming part of valid
characters as characters with wchar_t values in the range 0xdc80
0xdcff (those are codepoints not mapped to characters in Unicode
as they are used for the UTF16 surrogate pairs). See:
https://www.python.org/dev/peps/pep-0383/
http://www.unicode.org/reports/tr36/#EnablingLosslessConversion

That's what zsh does since 2015:
https://www.zsh.org/mla/workers/2015/msg02338.html though that
approach is only fully valid in locales where wchar_t values are
the unicode code points.

With that approach, *.txt still matches on $'α-foo-\x80.txt',
but not because it switches to a byte-wise matching. But because
that \x80 byte is considered as a character. That file is
matched by ?-*-?.txt in zsh, but not bash (where it matches
??-*-?.txt instead).

Once you do that decoding that way, then ranges would be just a
matter of comparing wchar_t values with globasciiranges and
wcscoll() without though with the caveat that the behaviour of
wcscoll() for those 0xdc80..0xdcff values is likely undefined¹,
so would need to be special-cased.

So with globasciiranges off, c (with wchat_t value wc) would be
matched by [x-y] (wx, wy wchat_t values) if

1. if both x and y are within L'0'..L'9' or within
   0xdc80..0xdcff: numeric comparison (wc >= wx && wc <= wy)
2. otherwise collation order but:
3. we keep the "resorting to wchar_t value comparison when two
   characters collate the same" and:
4. special case: if both x and y are lower case letters, c is
   not matched if it's an uppercase letter and vice versa (like
   in those GNU regexps above or in ksh93).

I still feel globasciiranges should be the default. English is
not my native language, but even then, I've yet to find a use
for that collation-based range matching. More often than not, it
just gets in the way. It may be nice that [a-z] matches on é,
but it only matches on its precomposed form (U+00E9, not 0x0065
U+0301), and it doesn't match on ź. What it matches varies from
system to system, with the version of Unicode, etc.

Matching on Unicode code points makes sense because the Unicode
order is well defined, the same regardless of the system, is
independant from the version of Unicode. For most alphabets, the
order matches the unicode range. All the [a-z], [A-Z],
[0-9a-fA-F], [0-9] which represent 99.9% of what users use
ranges for work. The [α-ω] also works as expected. One can match
on Unicode pages, like hangul_jamo=$'[\u1100-\u11ff]'. Here
with the dcXX handling, you can also filter out or detect
non-characters in UTF-8 easily with ${var//[$'\x80'-$'\xff']/}

Now looking at drawbacks.

For globasciiranges, there is the question of [x-y] when x is a
character but y is not (is a byte not forming part of a valid
character), and there's the question of bytes not forming part
of valid characters being matched by [$'\u1'-$'\U10FFFF'] or
[$'\ud7ff'-$'\ue000'].

The C locale should be the one where we can expect to be able to
work with byte values, as in [[ $'\x85' = [$'\x50'-$'\xaf'] ]],
and on some systems (GNU ones at least), in that locale,
mbtowc() on 0x80 .. 0xff return -1, so those bytes are mapped to
0xdc80 .. 0xdcff. So if only for that, it's important that we
allow ranges with mixed character+non-character ends and allow
non-characters to be matched by ranges other than those where
both ends are non-characters.

That does mean potential surprises such as:

$ LC_ALL=ru_RU.koi8r zsh -c "[[ $'\xe9' = [$'\x1'-$'\x80'] ]]" && echo yes
yes

(on a GNU system) as $'\xe9' is a valid character in that locale
(И U+0418) but $'\x80' is not (so becomes 0xdc80).

Same for:

$ export LC_ALL=en_US.UTF-8
$ zsh -c "[[ $'\ue9' = [$'\x1'-$'\x80'] ]]" && echo yes
yes

As one needs to remember matching is done based on codepoint.

The other problem is that outside of locales using ASCII,
ISO-8859-1 or UTF-8 charmaps, the behaviour varies between
system as not all systems implement wchar_t's the same.

For instance, on FreeBSD, in single-byte locales, the wchar_t
value is the byte value (even for bytes that don't represent a
character in the locale such as that 0x80 in KOI8-R above), while
on GNU systems, the wchar_t value is always the Unicode code
point. For example, that KOI8-R И above would be 0xe9 on FreeBSD
and 0x418 on GNU.

gawk does not behave exactly like zsh:

$ LC_ALL=fr_FR.iso885915@euro zsh -c "[[ $'\xa4' = [$'\xa3'-$'\xa5'] ]]" && 
echo yes
$ printf '\xa4' | LC_ALL=fr_FR.iso885915@euro gawk $'/[\xa3-\xa5]/ {print 
"yes"}'
yes

When the locale's charmap is single-byte, gawk seems to be
resorting to comparing the byte values of the encoding instead
of the wchar_t value like in multi-byte locales.

0xa4 in iso885915 is € (U+20AC), while 0xa3 is U+00A3 (£) and
0xa5 is U+00A5 (¥). So for zsh, on GNU systems, like in UTF-8
locales, € is not matched by [£-¥] because 0x20ac is not between
0xa3 and 0xa4, but for gawk, it is because 0xa4 is between 0xa3
and 0xa5. On FreeBSD or Solaris, gawk and zsh would agree. Both
approaches have advantages and inconveniences.

To match all characters excluding bytes that don't form part of
characters, you need [$'\u1'-$'\ud7ff'$'\ue000'-$'\U10FFFF'] (or
[^$'\x80'-$'\xff'], but not in all locales/systems).
That is the unicode range of characters, so it makes sense, but
it's a bit of a mouthful.

---
¹ In UTF-8 locales, on a current GNU system, 0xd800..0xdfff
wchar_t values seem to collate the same as all other characters
with undefined collation order, and before all other characters.
Same on Solaris except that they even collate the same as the
empty string (!). while on FreeBSD, they collate according to
their code points and before the ones with "defined" collation
order. So, in effect, with the fall back to wchar_t comparison
for characters that collate the same, we would get the same
behaviour on GNU, Solaris and FreeBSD if relied on wcscoll() for
those as well.

-- 
Stephane

[long] [x-y] bash range wildcard behaviour with non-ASCII data is very random especially with globasciiranges on

Reply via email to