On 01/09/2025 18:19, Brian Inglis via Cygwin wrote:
> On 2025-08-31 13:06, Mariusz Wodzicki via Cygwin wrote:
>> Description of the problem.
>> [0-9] picks also certain Unicode superscript characters ( namely, ⁰
⁴ ⁵ ⁶
>> ⁷ ⁸ ⁹ ), and every Unicode subscript character.
>>
>> Example: the directory has the following files:
>> $ /bin/ls
>> ₀.txt ₁.txt ₂.txt ₃.txt ₄.txt ₅.txt ₆.txt ₇.txt ₈.txt ₉.txt
>> ⁰.txt ¹.txt ².txt ³.txt ⁴.txt ⁵.txt ⁶.txt ⁷.txt ⁸.txt ⁹.txt
>>
>> $ /bin/ls [0-9].txt
>> ₀.txt ₁.txt ₃.txt ⁴.txt ⁵.txt ⁶.txt ⁷.txt ⁸.txt
>> ⁰.txt ₂.txt ₄.txt ₅.txt ₆.txt ₇.txt ₈.txt
>>
>> $ locale
>> LANG=en_US.UTF-8
>> LC_CTYPE="en_US.UTF-8"
>> LC_NUMERIC="en_US.UTF-8"
>> LC_TIME="en_US.UTF-8"
>> LC_COLLATE="en_US.UTF-8"
>> LC_MONETARY="en_US.UTF-8"
>> LC_MESSAGES="en_US.UTF-8"
>> LC_ALL=
>>
>> System.
>> Fully up to date Windows 11
>> cygwin 3.6.4-1
>> bash 5.2.21-1
>
> For reproducible results prefix commands with LC_ALL=C … or possibly
just LC_COLLATE=C or LC_CTYPE=C or =POSIX to standardize the locale,
otherwise many commands will respect the current locale, and some
respect Unicode regardless of locale e.g. `info wc`:
>
> "Unless the environment variable ‘POSIXLY_CORRECT’ is set, GNU ‘wc’
treats the following Unicode characters as white space even if the
current locale does not: U+00A0 NO-BREAK SPACE, U+2007 FIGURE SPACE,
U+202F NARROW NO-BREAK SPACE, and U+2060 WORD JOINER."
>
> For GNU utilities, where info pages are preferred, such as
coreutils*, compiler and language processors, and tools packages, many
details do not appear in the man pages, for example:
>
> "Full documentation <https://www.gnu.org/software/coreutils/wc> or
available locally via: info '(coreutils) wc invocation'"
>
> although `info wc` shows the same page.
>
> —————
> * [ arch b2sum base32 base64 basename cat chcon chgrp chmod chown
chroot cksum comm cp csplit cut date dd df dir dircolors dirname du echo
env expand expr factor false fmt fold gkill groups head hostid id
install join link ln logname ls md5sum mkdir mkfifo mknod mktemp mv nice
nl nohup nproc numfmt od paste pathchk pinky pr printenv printf ptx pwd
readlink realpath rm rmdir runcon seq sha1sum sha224sum sha256sum
sha384sum sha512sum shred shuf sleep sort split stat stdbuf stty sum
sync tac tail tee test timeout touch tr true truncate tsort tty uname
unexpand uniq unlink users vdir wc who whoami yes
>
Bash is GNU but isn't part of coreutils as far as I know. Type 'man
bash' and then read the 'Pattern Matching' section for its globbing
behaviour.
TL;DR For bash 5.2, using 'export LC_ALL=C.UTF-8' as Brian suggests or
'export LC_COLLATE=C.UTF-8' or 'shopt -s globasciiranges' should revert
to simple ASCII ranges for '[0-9]', '[a-z]' etc.
I'm seeing the correct behaviour with up-to-date Cygwin bash/coreutils
etc. by the way. 'echo [0-9]*' only expands out sub/super-digits if I
use 'LC_COLLATE=en_GB.UTF-8' or similar with 'shopt -u globasciiranges'.
--
Sam Edge
--
Problem reports: https://cygwin.com/problems.html
FAQ: https://cygwin.com/faq/
Documentation: https://cygwin.com/docs.html
Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple