Hello, I was wondering why in my en_GB.UTF-8 locale, [0-9] matched "only" on 1044 characters in bash 5.1 while in bash 4.4 it used to match on 1050 different ones.
It turns out it's because since 5.0, the globasciiranges option is enabled by default. Then I tried to understand what that option was actually doing, but the more I tested, the least sense it made and the whole thing seems to be quite buggy to me. The manual says: DOC> 'globasciiranges' DOC> If set, range expressions used in pattern matching bracket DOC> expressions (*note Pattern Matching::) behave as if in the DOC> traditional C locale when performing comparisons. That is, DOC> the current locale's collating sequence is not taken into DOC> account, so 'b' will not collate between 'A' and 'B', and DOC> upper-case and lower-case ASCII characters will collate DOC> together. In the C locale, POSIX defines the collation order as being the same as the order of characters in the ASCII character set even if the C locale's charmap is not ASCII like on EBCDIC systems (and all other characters if any in the C locale's charmap (which have to be single-bytes) have to sort after ^? the last ASCII character). On all systems I've ever used, in the C locale, the charset was ASCII, and the collation order was based on the byte value of the encoding (strcoll() is equivalent to strcmp()), even on characters that are undefined (the ones with encoding 0x80 to 0xff). Yet, the DOC above doesn't reflect what happens in bash in multibyte locales (the norm these days), as bash still appears to (sometimes at least) decode sequences of bytes into the corresponding character in the user's locale, not in the C locale and use the locale's collation order. I should point out that I've since read: https://lists.gnu.org/archive/html/bug-bash/2018-08/msg00027.html https://lists.gnu.org/archive/html/bug-bash/2019-03/msg00145.html (and https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html) but those barely scratched the surface. I had a look at the code to try and understand what was going on, and here are my findings below: What I found was that the behaviour of [x-y] ranges in wildcards depended on: - the setting of the globasciiranges option - whether the locale uses a single-byte-charset or not - in locales with multibyte characters - whether the pattern and subject contain sequences of bytes that don't form valid characters - whether the pattern and subject contain only single byte characters - whether the wide char value of the characters are in the 0..255 range or not. (I've not looked at the effect of nocasematch/nocaseglob). == locales with single-byte charset Let's take those out of the way as they are the simplest: In single-byte per character locales (which were common until the late 90s), [x-y] matches on characters whose byte encoding is (numerically) between that of x and that of y when the globasciiranges is on as it is by default since 5.0 (whether the characters are in the ASCII set or not or whether the locale's charset is a superset of ASCII or not; in other words, it has little to do with ASCII). When globasciiranges is off, [x-y] matches on characters c that collate between x and y, but with an additional check: if c collates the same as x but has a byte value that is less than that of x or collates the same as y but has a byte value greater than that of y, then it won't be included (a good thing IMO). == multi-byte charset locales === invalid text First, independently of ranges, bash pattern matching operates on two different modes whether the input and pattern are valid text in the locale or not. If the pattern or subject contains sequences of bytes that don't form valid characters in the locale, then the pattern matching works at byte level For instance, in a UTF-8 locale. [[ $string = [é-é]???? ]] matches on strings that start with é and are followed by exactly 4 characters as long as $string contains valid UTF-8 text, but if not, the test becomes: [[ $string = [\xc3\xa9-\xc3-\xa9]???? ]] where [\xc3\xa9-\xc3-\xa9] matches on byte 0xc3 or bytes 0xa9 to 0xc3 or byte 0xa9 (so bytes 0xa9 to 0xc3) and ? matches a single byte (including each byte of each valid multibyte character). For instance: $ string=$'áé\x80' bash -c '[[ $string = [é-é]???? ]]' && echo yes yes as that's the 0xc3 0xa1 0xc3 0xa9 0x80 byte sequence. [[ é = *$'\xa9' ]] and [[ á = [é-é]*$'\xa1' ]] both match, this time because the *pattern* is not valid UTF-8. Or in $ LANG=zh_HK.big5hkscs luit $ bash --norc bash-5.1$ [[ '*' = [β*] ]] && echo yes yes bash-5.1$ [[ $'αwhatever\xff]' = [β*] ]] && echo yes yes A pattern meant to match any of two characters also matches strings of any length as it become a completely different pattern once applied to a string that contains sequence of bytes not forming valid characters in the locale. In that same locale: bash-5.1$ pat='α\*' bash-5.1$ [[ 'α*' = $pat ]] && echo yes yes bash-5.1$ [[ 'αblah\xffblah' = $pat ]] && echo yes yes === valid text ==== globasciiranges option off If we're in a case where both subject and pattern are valid text, with globasciiranges off (default before 5.0, after shopt -u globasciiranges in 5.0+), ranges are based on locale collation order like for single-byte locales. That explains for instance why E or É are matched by [a-z] in a typical en_US.UTF-8 locale. ==== globasciiranges option on That's where it gets a bit random. First if neither the pattern nor subject contain multibyte characters (and the pattern doesn't contain non-standard character classes!?), we're back to the byte-wise operator above. In theory, that shouldn't be a valid optimisation as it's not guaranteed that the wchar_t value of a single byte character be the same as that byte value (not even for 0..127 ones, see https://www.quora.com/What-exactly-does-__STDC_MB_MIGHT_NEQ_WC__-stand-for?share=1), but in practice, at least on GNU, FreeBSD and Solaris, that seems to hold true for all the locales I've looked at. Now, for a character c to be matched by [x-y], it will also depend on whether the wide char value of c, x and y is less than 256 or not. When comparing c with x or y to see if it's in the range, if both c and x (or c and y) are less than 256, then it compares the wchar_t value numerically. If not, then we use the collation order (with the same additional check as above if characters collate the same). What the wide char value is meant to be is not specified. On GNU systems, it's the Unicode code point. For instance for ¥ (U+00A5, the Yen currency symbol) which is encoded as 0xa5 in a en_US.ISO8859-1 locale, 0xc2 0xa5 in en_US.UTF-8, 0xa2 0x44 in zh_TW.Big5, 0x81 0x30 0x84 0x36 in zh_CN.GB18030 (0x5c in the Shift-JIS charset), on a GNU system, the wide char value will always be 0xa5. So, there, the characters that have a wide-character value < 256 will be the one from the iso8859-1 charset (the Unicode characters U+0001 to U+00FF). On FreeBSD, that wide char value will depend on the locale's character encoding. For ¥, that is 0xa5 in en_US.ISO8859-1 and en_US.UTF-8, 0xa244 in zh_TW.Big5, 0x1308436 in zh_CN.GB18030, 0x5c in ja_JP.SJIS (yes, FreeBSD has a locale with that charset even though it doesn't have a \ nor ~ character!). So the list of characters with a wide char value < 256 will vary with the locale. The only thing we can probably reliably count on is that on ASCII-based system, for ASCII characters, the wchar value will be the byte value. On GNU, FreeBSD and Solaris, I've found that in UTF-8 locales (the ones that matter these days), it was always the unicode code point. That explains why in a en_US.UTF-8 locale for instance [0-9] matches on 1050 different characters with globasciiranges off but "only" 1044 when globasciiranges is on. The 6 missing ones are the ones with codepoints in the 0..255 range that collate between 0 and 9 but have a code point not within 0x30..0x39 (²³¹¼½¾, U+00B2 U+00B3 U+00B9 U+00BC U+00BD U+00BE). That also explains why with globasciiranges on in that same locale, [a-f] no longer matches on EéÉ as those are in iso8859-1 (code point below 256), but still matches on the ĒĔĖĘĚȄȆȨḔḖḘḚḜẸẺẼẾỀỂỄỆ upper case characters, and hundreds more with a code point above 255. On FreeBSD/Solaris and in a zh_CN.GB18030 locale, [a-f] would match on éÉ though as the wchar_t value of those would be greater than 255 (the wchar_t value there is not the Unicode code point). == how about [x-y] in the [[ =~ ]] operator? In that case, bash calls the system's regex API. So the behaviour you get will vary with the system and locale, and there's not much bash can do about it. On GNU systems, I regularly check what [0-9] matches. It used to match on thousands of characters like bash wildcards. Then only a few locales matched characters other than 0123456789. Today (glibc 2.31), I can't find any locale where regexec() or fnmatch() matches any character other than 0123456789. But that seems to be handled as a special case. [0-a] still matches the same 1050 characters as bash wildcards do, plus "a" plus a number of variations on the digit 9 (that sorts *after* 9). [a-z] matches on 1367 characters (though no uppercase ones so roughly half as many as with bash wildcards), though that also seems to be handled as a special case. Collating elements (and all the surprises it entails) are supported: $ LC_ALL=hu_HU.UTF-8 bash -c '[[ dzs =~ ^[a-z]$ ]]' && echo yes yes There is some support for non-text data. For instance a standalone 0x80 byte in a UTF-8 locale won't be matched by . nor [^anything], but is matched by a 0x80 byte in the regexp as long as it's not in a bracket expression ([[ $'\x80' =~ ^$'\x80'$ ]] matches but [[ $'\x80' =~ ^[$'\x80']$ ]] doesn't). It's even possible to match inside a character ([[ é =~ ^$'\xc3' ]] matches). == When is globasciiranges useful? AFAICT, globasciiranges only works as expected when using ASCII only patterns to match on exclusively ASCII data on an ASCII system. And mostly so that [a-z] no longer matches uppercase English letters and [A-Z] no longer matches lower case English letters. But even then, setting LC_CTYPE and LC_COLLATE to C works better as it does the same, but also doesn't match random non-ASCII characters like [a-z] with globasciiranges does. == What could be done to sort out this mess? Here bash has two modes: with or without globasciiranges. If feels to me that both modes could be improved so they match users expectations better. Here's an approach I propose below to start the discussion, please let me know what you think: It would make sense to me for the globasciiranges to work like in gawk or zsh (to be more globcodepointranges) and non-globasciiranges to work like those GNU regexps described above except for the handling of bytes not forming parts of characters (and I for one can still live without collating element matching matching support). ? and [^x]/[!x] should match them and * should match across them. So that *.txt matches on all file names that end in .txt even if the leading part is not valid text in the current locale (like it does atm but for the wrong reasons). It's important because not doing so can allow one to fool sanity checks such as: case $input in ("" | *[!0123456789]*) die "need a decimal integer";; esac Several languages have gone for python3's approach at considering bytes in filenames not forming part of valid characters as characters with wchar_t values in the range 0xdc80 0xdcff (those are codepoints not mapped to characters in Unicode as they are used for the UTF16 surrogate pairs). See: https://www.python.org/dev/peps/pep-0383/ http://www.unicode.org/reports/tr36/#EnablingLosslessConversion That's what zsh does since 2015: https://www.zsh.org/mla/workers/2015/msg02338.html though that approach is only fully valid in locales where wchar_t values are the unicode code points. With that approach, *.txt still matches on $'α-foo-\x80.txt', but not because it switches to a byte-wise matching. But because that \x80 byte is considered as a character. That file is matched by ?-*-?.txt in zsh, but not bash (where it matches ??-*-?.txt instead). Once you do that decoding that way, then ranges would be just a matter of comparing wchar_t values with globasciiranges and wcscoll() without though with the caveat that the behaviour of wcscoll() for those 0xdc80..0xdcff values is likely undefined¹, so would need to be special-cased. So with globasciiranges off, c (with wchat_t value wc) would be matched by [x-y] (wx, wy wchat_t values) if 1. if both x and y are within L'0'..L'9' or within 0xdc80..0xdcff: numeric comparison (wc >= wx && wc <= wy) 2. otherwise collation order but: 3. we keep the "resorting to wchar_t value comparison when two characters collate the same" and: 4. special case: if both x and y are lower case letters, c is not matched if it's an uppercase letter and vice versa (like in those GNU regexps above or in ksh93). I still feel globasciiranges should be the default. English is not my native language, but even then, I've yet to find a use for that collation-based range matching. More often than not, it just gets in the way. It may be nice that [a-z] matches on é, but it only matches on its precomposed form (U+00E9, not 0x0065 U+0301), and it doesn't match on ź. What it matches varies from system to system, with the version of Unicode, etc. Matching on Unicode code points makes sense because the Unicode order is well defined, the same regardless of the system, is independant from the version of Unicode. For most alphabets, the order matches the unicode range. All the [a-z], [A-Z], [0-9a-fA-F], [0-9] which represent 99.9% of what users use ranges for work. The [α-ω] also works as expected. One can match on Unicode pages, like hangul_jamo=$'[\u1100-\u11ff]'. Here with the dcXX handling, you can also filter out or detect non-characters in UTF-8 easily with ${var//[$'\x80'-$'\xff']/} Now looking at drawbacks. For globasciiranges, there is the question of [x-y] when x is a character but y is not (is a byte not forming part of a valid character), and there's the question of bytes not forming part of valid characters being matched by [$'\u1'-$'\U10FFFF'] or [$'\ud7ff'-$'\ue000']. The C locale should be the one where we can expect to be able to work with byte values, as in [[ $'\x85' = [$'\x50'-$'\xaf'] ]], and on some systems (GNU ones at least), in that locale, mbtowc() on 0x80 .. 0xff return -1, so those bytes are mapped to 0xdc80 .. 0xdcff. So if only for that, it's important that we allow ranges with mixed character+non-character ends and allow non-characters to be matched by ranges other than those where both ends are non-characters. That does mean potential surprises such as: $ LC_ALL=ru_RU.koi8r zsh -c "[[ $'\xe9' = [$'\x1'-$'\x80'] ]]" && echo yes yes (on a GNU system) as $'\xe9' is a valid character in that locale (И U+0418) but $'\x80' is not (so becomes 0xdc80). Same for: $ export LC_ALL=en_US.UTF-8 $ zsh -c "[[ $'\ue9' = [$'\x1'-$'\x80'] ]]" && echo yes yes As one needs to remember matching is done based on codepoint. The other problem is that outside of locales using ASCII, ISO-8859-1 or UTF-8 charmaps, the behaviour varies between system as not all systems implement wchar_t's the same. For instance, on FreeBSD, in single-byte locales, the wchar_t value is the byte value (even for bytes that don't represent a character in the locale such as that 0x80 in KOI8-R above), while on GNU systems, the wchar_t value is always the Unicode code point. For example, that KOI8-R И above would be 0xe9 on FreeBSD and 0x418 on GNU. gawk does not behave exactly like zsh: $ LC_ALL=fr_FR.iso885915@euro zsh -c "[[ $'\xa4' = [$'\xa3'-$'\xa5'] ]]" && echo yes $ printf '\xa4' | LC_ALL=fr_FR.iso885915@euro gawk $'/[\xa3-\xa5]/ {print "yes"}' yes When the locale's charmap is single-byte, gawk seems to be resorting to comparing the byte values of the encoding instead of the wchar_t value like in multi-byte locales. 0xa4 in iso885915 is € (U+20AC), while 0xa3 is U+00A3 (£) and 0xa5 is U+00A5 (¥). So for zsh, on GNU systems, like in UTF-8 locales, € is not matched by [£-¥] because 0x20ac is not between 0xa3 and 0xa4, but for gawk, it is because 0xa4 is between 0xa3 and 0xa5. On FreeBSD or Solaris, gawk and zsh would agree. Both approaches have advantages and inconveniences. To match all characters excluding bytes that don't form part of characters, you need [$'\u1'-$'\ud7ff'$'\ue000'-$'\U10FFFF'] (or [^$'\x80'-$'\xff'], but not in all locales/systems). That is the unicode range of characters, so it makes sense, but it's a bit of a mouthful. --- ¹ In UTF-8 locales, on a current GNU system, 0xd800..0xdfff wchar_t values seem to collate the same as all other characters with undefined collation order, and before all other characters. Same on Solaris except that they even collate the same as the empty string (!). while on FreeBSD, they collate according to their code points and before the ones with "defined" collation order. So, in effect, with the fall back to wchar_t comparison for characters that collate the same, we would get the same behaviour on GNU, Solaris and FreeBSD if relied on wcscoll() for those as well. -- Stephane