On 9/8/25 2:24 AM, Grisha Levit wrote:
On Sun, Sep 7, 2025 at 2:46 AM Duncan Roe wrote:`ls -1 [0-5]*` should produce the same output as `ls -1` but instead:-[...]superscripts ¹, ² & ³ are missing.My take at an explanation: '₀' - '₉' are Unicode U+2080-9. These display fine. '⁰' is U+2070 & '⁹' is U+2079, but '¹' is U+00B9, '²' is U+00B2 & '³' is U+00B3.This appears to be a bug with the globasciiranges option. The documentation suggests that enabling this option will disable locale- aware collation in range expressions:
Yes, that's the idea. The range depends on codepoints rather than locale- specific collating sequences. See https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html for "rational range interpretation."
globasciiranges
If set, range expressions used in pattern matching bracket
expressions (see Pattern Matching above) behave as if in
the traditional C locale when performing comparisons. That
is, pattern matching does not take the current locale’s
collating sequence into account, so b will not collate
between A and B, and upper‐case and lower‐case ASCII
characters will collate together.
But the implementing code [1] for multibyte locales does the following:
385 charcmp_wc (wint_t c1, wint_t c2, int forcecoll)
...
393 if (forcecoll == 0 && glob_asciirange && c1 <= UCHAR_MAX && c2 <=
UCHAR_MAX)
394 return ((int)(c1 - c2));
...
399 return (wcscoll (s1, s2));
So, in fact, locale-aware collation is disabled only if the range start
and end codepoints are both in the range U+0001..U+00FF. This doesn't
make much sense for codepoints in the range U+0080..U+00FF.
Maybe not common, but it's perfectly valid.
We should either:
* Remove the <= UCHAR_MAX checks (which would make the behavior match
the documentation)
This is the right fix.
--
``The lyf so short, the craft so long to lerne.'' - Chaucer
``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU [email protected] http://tiswww.cwru.edu/~chet/
OpenPGP_signature.asc
Description: OpenPGP digital signature
