Re: documentation bug re character range expressions

Marcel (Felix) Giannelia Fri, 03 Jun 2011 00:09:24 -0700

Is it really a programmer mistake, though, to assume that [A-Z] is onlycapital letters? A through Z are a contiguous range in everyrepresentation system except EBCDIC, and it is even contiguous themodern unicode.

In the world of programming characters are numbers, and programmers knowthis (especially if they've ever learned any C). For the example of[a-c], programmers are treating letters the way the machine treats them,as numbers.

How is the person typing [a-c] the one making the mistake when itresults in matching against values outside of that range? To make itplainer, type it as [\0x61-\0x63] -- if you saw that in a program, youwould expect that to cover 0x61, 0x62, 0x63, wouldn't you? If you weredesigning a programming language, wouldn't you make it do that?

If person A types [\0x61-\0x63] on software written by person B and itcomes out matching 0x61, 0x41, 0x62, 0x42, 0x63, and perhaps somethingcompletely different when the same code is run on a computer in Russia,who would you say made the programming mistake? Surely not person A.

This is something that wasn't a "bad programming habit" until somewhere,someone made a decision that removed meaning from a sensible,logical-looking syntax.


Let's compare the syntaxes:

Under the old notation, there was:

- a succinct way to specify lowercase letters: [a-z]

- likewise for uppercase: [A-Z]

- likewise for case-insensitive: [A-Za-z]

- an easy way to specify ranges of letters of a particular case: [a-m],[A-M]


- case-insensitive ranges: [A-Ma-m]

Under the new notation, those things are written as:

- lowercase letters: [[:lower:]] (over twice as long to type)

- uppercase letters: [[:upper:]] (likewise)

- case-insensitive: [[:alpha:]] (not as bad, but still longer)

- how *are* you supposed to specify case-sensitive ranges?[abcdefghijklm] looks ridiculous.

- case-insensitive ranges: [a-M] (looks like an error at first glance:"why is the M uppercase?" you need to know something about the systeminternals to see why that's not wrong. And that something is a lot morecomplicated to explain than "computers represent letters as numbers")

Bash is a shell. Shells should have a quick, brief, plain language sothat one can get things done in them. Shells should also be quiteportable: syntax that works on one system should work on any other asmuch as possible.

[[:alpha:]] is too difficult to type to make it useful for the kind ofquick pattern-matching that character ranges are used for on theinteractive shell. Try it. Open-bracket, colon is an awkward sequencecompared to something like "[a-z]".

But usually one doesn't want all of the alphabet, nor caseinsensitivity. I have actually never had occasion to say [A-Za-z] on thecommand line, or even [A-Ca-c]. I have, however, very often wanted tograb everything with a lowercase 'a' through lowercase 'k', for instance.

Previously, that would have been [a-k]. Now I have no way to specify itexcept [abcdefghijk], and I'm not typing that. A useful feature is gone.

You say this is not only a "bash problem" because it's a programmer'smistake to assume that [a-c] means the same thing in bash as it does inPerl, Python, Java, C/C++ (POSIX regex.h, with system locale set!),JavaScript, PHP, sed, grep, and on and on -- you can see why one mightmake this "mistake".

And these aren't historical examples, these are modern implementationsof these languages that I just tested this on to double-check, on asystem with its locale set to something that collatescase-insensitively. Bash is the *only* thing I know of that treatscharacter ranges this way, so I would say that does make it "only a bashproblem".

Even grep, whose man page says it obeys LC_COLLATE and the locale,actually has [a-c] equivalent to [abc] on all locales. Someone must havesnuck in and fixed it. I'm guessing that if grep were to start usinglocale-aware character ranges, a heck of a lot more people wouldcomplain than do about bash. This is a seldom-used feature in bash butmany, many people rely on grep being predictable and standard.


~Felix.


On 2011-06-02 22:32, Jan Schampera wrote:

Hi,


just as side note, not meant to touch the maintainer discussion.

This is not only a "Bash problem". The programmer/user mistake to use
[A-Z] for "only capital letters, capital A to capital Z" is a very
common one.

But I'm not sure if every official application-level documentation
should cover those kind of pitfalls. There would be many topics around
"bad programming habbits" that should be documented.

Re: documentation bug re character range expressions

Reply via email to