Re: documentation bug re character range expressions

Marcel (Felix) Giannelia Fri, 03 Jun 2011 13:27:01 -0700

On Fri, June 3, 2011 10:03, Greg Wooledge wrote:
> On Fri, Jun 03, 2011 at 09:12:07AM -0700, Marcel (Felix) Giannelia wrote:
>
> [...]
>
> In HP-UX's en_US.iso88591 locale, the characters are in a COMPLETELY
> different order.  You can't easily figure out what that order is, because
> it's not documented anywhere, but by using tricks you can beat it into
> submission.  Instead of having two separate ranges from a to z, and from A
> to Z, there's just one big range from A to z (actually þ) which looks
> something like:
>
> A a Á á À à Â â Ä ä Å å Ã ã Æ æ B b C c Ç ç D d Ð ð E e É .... Z z Þ þ
>
>
> So when you write A-Z you mean A a Á ... Z.  And a-z means a Á ... Z z.
> In other words, when you tell tr to map from A-Z to a-z all you're
> actually doing is shifting the map one position to the right.  So H becomes
> h, e becomes É, l becomes M, and so on.  Whereas in ASCII, mapping from
> A-Z to a-z shifts everything 32 to the right (the
> difference between 'A' and 'a'), so H becomes h and so on.
>
> The GNU people apparently didn't like this,


See, this is where people like me get confused -- it sounds from your
description above ("...can't easily figure out what this order is...",
"...beat it into submission", and so forth) as if you don't much care for
it either.

It sounds to me like what you're saying is, the *only* uses of bracket
range expressions guaranteed to be "portable" are things like [[:upper:]]
and [[:lower:]]. But I put "portable" in quotation marks just then,
because to my mind the word "portable" implies "has the same behaviour on
all systems", whereas things like [[:upper:]] are locale-dependent; they
change their behaviour depending on system settings.

It would be a bit like "echo $((1 + 1))" printing "2", "4", or "5",
depending on the locale setting (dear lordy I hope it doesn't).

[0-9] presumably still works consistently across all platforms -- I hope?
Or does it map to something like:

0 <the Mayan seashell symbol> <the traditional Chinese symbol for 0> 1
<the Cherokee script symbol for "1"> <the Mayan "dot" symbol> <the letters
i and I (because of Roman numerals)> 2 <the Mayan "two dots" symbol> <the
symbol from Papua/New Guinea that means "not many, but at least more than
one bean"> <two consecutive occurrences of the letter I> ...

etc.? I'm just having trouble imagining in what situations this kind of
behaviour is useful. But, from your description, my understanding now is
that it's no longer meant to be useful, i.e. no longer meant to be used
that way.

I think a good solution to this, then, is to just deprecate the use of "-"
in bracket expressions entirely. As you say, it's non-portable and
unpredictable and highly locale- and even OS-dependent (and whether GNU is
in the wrong or HP is the wrong is neither here nor there).

Use the magic of locales to mark "-" as "just another character", and then
[A-Z] means only "A", "-", or "Z" -- at least that would be easier for the
common woman and man to understand than "[A-Z] means 'that long string of
things from the HP ISO 8859-1 code page if you're on an HP-UX system,
uppercase and lowercase letters except lowercase 'a' if  you're on Linux,
uppercase A-Z if you're in the C locale, and so on'."

This way, people would be forced to either use the C locale, or use
[[:upper:]] if they want to match uppercase letters, which is what you've
been saying should happen.

> [...]

~Felix.

Re: documentation bug re character range expressions

Reply via email to