Is it really a programmer mistake, though, to assume that [A-Z] is only
capital letters? A through Z are a contiguous range in every
representation system except EBCDIC, and it is even contiguous the
modern unicode.
In the world of programming characters are numbers, and programmers know
this (especially if they've ever learned any C). For the example of
[a-c], programmers are treating letters the way the machine treats them,
as numbers.
How is the person typing [a-c] the one making the mistake when it
results in matching against values outside of that range? To make it
plainer, type it as [\0x61-\0x63] -- if you saw that in a program, you
would expect that to cover 0x61, 0x62, 0x63, wouldn't you? If you were
designing a programming language, wouldn't you make it do that?
If person A types [\0x61-\0x63] on software written by person B and it
comes out matching 0x61, 0x41, 0x62, 0x42, 0x63, and perhaps something
completely different when the same code is run on a computer in Russia,
who would you say made the programming mistake? Surely not person A.
This is something that wasn't a "bad programming habit" until somewhere,
someone made a decision that removed meaning from a sensible,
logical-looking syntax.
Let's compare the syntaxes:
Under the old notation, there was:
- a succinct way to specify lowercase letters: [a-z]
- likewise for uppercase: [A-Z]
- likewise for case-insensitive: [A-Za-z]
- an easy way to specify ranges of letters of a particular case: [a-m],
[A-M]
- case-insensitive ranges: [A-Ma-m]
Under the new notation, those things are written as:
- lowercase letters: [[:lower:]] (over twice as long to type)
- uppercase letters: [[:upper:]] (likewise)
- case-insensitive: [[:alpha:]] (not as bad, but still longer)
- how *are* you supposed to specify case-sensitive ranges?
[abcdefghijklm] looks ridiculous.
- case-insensitive ranges: [a-M] (looks like an error at first glance:
"why is the M uppercase?" you need to know something about the system
internals to see why that's not wrong. And that something is a lot more
complicated to explain than "computers represent letters as numbers")
Bash is a shell. Shells should have a quick, brief, plain language so
that one can get things done in them. Shells should also be quite
portable: syntax that works on one system should work on any other as
much as possible.
[[:alpha:]] is too difficult to type to make it useful for the kind of
quick pattern-matching that character ranges are used for on the
interactive shell. Try it. Open-bracket, colon is an awkward sequence
compared to something like "[a-z]".
But usually one doesn't want all of the alphabet, nor case
insensitivity. I have actually never had occasion to say [A-Za-z] on the
command line, or even [A-Ca-c]. I have, however, very often wanted to
grab everything with a lowercase 'a' through lowercase 'k', for instance.
Previously, that would have been [a-k]. Now I have no way to specify it
except [abcdefghijk], and I'm not typing that. A useful feature is gone.
You say this is not only a "bash problem" because it's a programmer's
mistake to assume that [a-c] means the same thing in bash as it does in
Perl, Python, Java, C/C++ (POSIX regex.h, with system locale set!),
JavaScript, PHP, sed, grep, and on and on -- you can see why one might
make this "mistake".
And these aren't historical examples, these are modern implementations
of these languages that I just tested this on to double-check, on a
system with its locale set to something that collates
case-insensitively. Bash is the *only* thing I know of that treats
character ranges this way, so I would say that does make it "only a bash
problem".
Even grep, whose man page says it obeys LC_COLLATE and the locale,
actually has [a-c] equivalent to [abc] on all locales. Someone must have
snuck in and fixed it. I'm guessing that if grep were to start using
locale-aware character ranges, a heck of a lot more people would
complain than do about bash. This is a seldom-used feature in bash but
many, many people rely on grep being predictable and standard.
~Felix.
On 2011-06-02 22:32, Jan Schampera wrote:
Hi,
just as side note, not meant to touch the maintainer discussion.
This is not only a "Bash problem". The programmer/user mistake to use
[A-Z] for "only capital letters, capital A to capital Z" is a very
common one.
But I'm not sure if every official application-level documentation
should cover those kind of pitfalls. There would be many topics around
"bad programming habbits" that should be documented.