Re: square bracket vs. curly brace character ranges

DJ Mills Fri, 14 Sep 2012 14:01:48 -0700

On Fri, Sep 14, 2012 at 1:49 AM, Marcel  Giannelia <i...@skeena.net> wrote:
> I believe I've found an inconsistency in bash or its documentation.
>
> I know the fact that things like [a-c] are highly locale-dependent in
> bash (doesn't mean I have to like it, but there it is). Fine. I've
> learned to live with it.
>
> But the other day I was on a fresh install (hadn't set
> LC_COLLATE=C yet, so I was in en_US.UTF-8), and this happened:
>
> $ touch {a..c}
> $ ls
> a  b  c
> $ touch {A..C}
> $ ls
> a  A  b  B  c  C
> $ ls {a..c}
> a  b  c
> $ ls [a-c]
> a  A  b  B  c
>
> Curly brace range expressions behave differently from square-bracket
> ranges. Is this intentional? This is under Arch Linux, bash version
> "4.2.37(2)-release (i686-pc-linux-gnu)".
>
> The man page seems to imply that the curly brace behaviour above is a
> bug:
>
> "When characters are supplied, the expression expands to each character
> lexicographically between x and y, inclusive."
>
> ...although this documentation suffers from the same problem as the
> passage about character class ranges, namely that it confuses
> lexicographic sort order (character collation *weights*) with
> character collation *sequence values* (they are not quite the same thing
> -- if they were, 'c' and 'C' would *always always always* appear
> together in a range expansion, because:
> $ touch aa B cd C
> $ ls -1
> aa
> B
> C
> cd
> ). The phrases "sorts between" and "lexicographically between" refer to
> collation *weights*, but bash clearly uses sequence values.
>
> It's a subtle distinction; I beat it to death in a thread
> from 2011, subject "documentation bug re character range expressions",
> but I don't think the documentation actually got changed.
>
> It seems the thinking goes something like, "since no one is supposed to
> use expressions like [a-c], we don't have precisely
> document, care, or even *know* what it means" -- a shame, because with
> LC_COLLATE=C set, [a-c] is actually quite useful, and in all other
> locales it isn't useful at all (it would be slightly useful if it used
> weights like the documentation says because then it would be like a
> case-insensitive range, but with it using sequence values instead, it's
> useless).
>
> The sheer number of threads we've got complaining about
> locale-dependent [a-c] suggests to me that the software should be
> changed to just do what people expect, especially since nothing is
> really lost by doing so.
>
> Oh well. Dead horses and all that -- but can we at least make the dead
> horses consistent? :)
>
> ~Felix.
>



http://mywiki.wooledge.org/locale

Re: square bracket vs. curly brace character ranges

Reply via email to