On Sun, Sep 7, 2025 at 2:46 AM Duncan Roe wrote:
> `ls -1 [0-5]*` should produce the same output as `ls -1` but instead:-
[...]
> superscripts ¹, ² & ³ are missing.
>
> My take at an explanation: '₀' - '₉' are Unicode U+2080-9. These display fine.
> '⁰' is U+2070 & '⁹' is U+2079, but '¹' is U+00B9, '²' is U+00B2 & '³' is
> U+00B3.
This appears to be a bug with the globasciiranges option.
The documentation suggests that enabling this option will disable locale-
aware collation in range expressions:
globasciiranges
If set, range expressions used in pattern matching bracket
expressions (see Pattern Matching above) behave as if in
the traditional C locale when performing comparisons. That
is, pattern matching does not take the current locale’s
collating sequence into account, so b will not collate
between A and B, and upper‐case and lower‐case ASCII
characters will collate together.
But the implementing code [1] for multibyte locales does the following:
385 charcmp_wc (wint_t c1, wint_t c2, int forcecoll)
...
393 if (forcecoll == 0 && glob_asciirange && c1 <= UCHAR_MAX && c2 <=
UCHAR_MAX)
394 return ((int)(c1 - c2));
...
399 return (wcscoll (s1, s2));
So, in fact, locale-aware collation is disabled only if the range start
and end codepoints are both in the range U+0001..U+00FF. This doesn't
make much sense for codepoints in the range U+0080..U+00FF.
We should either:
* Remove the <= UCHAR_MAX checks (which would make the behavior match
the documentation)
* Replace the <= UCHAR_MAX checks with <= 0x7f checks (and update the
documentation to note that C locale-style comparisons are done only
if both ends of the range are ASCII characters)
[1]
https://cgit.git.savannah.gnu.org/cgit/bash.git/tree/lib/glob/smatch.c?h=bash-5.3#n385