Re: locale specific ordering in EN_US -- why is a

2013-07-01 Thread Aharon Robbins
[ I know I'm going to regret this... ]

> `[a-z]' is case insensitive
>
>   You are encountering problems with locales.  POSIX mandates that `[a-z]'
>   uses the current locale's collation order -- in C parlance, that means
>   strcoll(3) instead of strcmp(3).

As of the 2008 standard, this is no longer true. Ranges are now
implementation defined. This is what gives us the leeway to move to
range interpretation not based on locales.

Although in theory locales seem like a good idea, and having '[a-z]'
include all kinds of other characters between the ASCII 'a' and 'z'
sounds nice, well over 10 years of experience has shown me, at least,
that it only confuses users and leads to problems.

For example, in some vendor en_US.UTF-8 locales, the ordering is

AaBb ... YyZz

and in others it is:

aAbB ... yYzZ

So try and explain why '[a-z]' includes all of a...z but only A...Y
or B...Z !!!

In short, nothing but pain and confusion and endless bug reports.

By defining '[a-z]' as using the machine's character set, you
know what you're getting, and you are compatible with original
Unix practice. (You are in for slight confusion on an EBCDIC
machine, but that was always the case anyway, and that is several
orders of magnitude less of a problem than the mess created by locales.)

After moving gawk to historic range interpretation, the number
of bug reports related to this has dropped to close to zero.
I'm happier, and my users are happier.

I'd be thrilled if the GLIBC locale tables would be fixed. But
in the meantime, I have decided to leave this whole issue behind me.

I'll go crawl back under my rock now.

Arnold



Bash case-modifying word expansions and locales

2013-07-01 Thread Tomasz Tomasik
Hello.

I have problem with case-modifying word expansions in bash.
http://wiki.bash-hackers.org/syntax/pe#case_modification

bash -c 'foo="żółw"; echo ${foo^^}'
żółW

Characters with diacritical marks are not affected.

However, it works in zsh:
zsh -c 'foo="żółw"; echo ${(U)foo}'
ŻÓŁW

Terminal character encoding: UTF-8

# locale
LANG=en_US.utf8
LC_CTYPE="pl_PL.utf8"
LC_NUMERIC="pl_PL.utf8"
LC_TIME="pl_PL.utf8"
LC_COLLATE="pl_PL.utf8"
LC_MONETARY="pl_PL.utf8"
LC_MESSAGES="pl_PL.utf8"
LC_PAPER="pl_PL.utf8"
LC_NAME="pl_PL.utf8"
LC_ADDRESS="pl_PL.utf8"
LC_TELEPHONE="pl_PL.utf8"
LC_MEASUREMENT="pl_PL.utf8"
LC_IDENTIFICATION="pl_PL.utf8"
LC_ALL=pl_PL.utf8

I tried also LC_ALL=en_US.utf8

# locale -a | grep -E "^pl_"
pl_PL
pl_PL.iso88592
pl_PL.utf8

# bash --version
GNU bash, version 4.1.2(1)-release (x86_64-redhat-linux-gnu)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 

This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

# echo "$BASH_VERSION"
4.1.2(1)-release

Package: bash-4.1.2-14.el6.x86_64.rpm from sl

I also tried bash-4.2.37-1.el6.x86_64.rpm compiled
from bash-4.2.37-1.fc16.src.rpm.
http://koji.fedoraproject.org/koji/buildinfo?buildID=343762

# uname -r
2.6.32-358.11.1.el6.x86_64

# grep "" /etc/*-release
/etc/lsb-release:LSB_VERSION=base-4.0-amd64:base-4.0-ia32:base-4.0-noarch:core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch
/etc/redhat-release:Scientific Linux release 6.4 (Carbon)
/etc/system-release:Scientific Linux release 6.4 (Carbon)

I really appreciate any help you can provide.

-- 
Tomasz Tomasik