Re: awk gsub problem

Eric Blake Mon, 20 Sep 2010 08:16:25 -0700

On 09/19/2010 02:33 PM, Lee wrote:

If LANG is "en_US" or "en_US.utf8", then the regular expression "[a-z]"
does *not* correspond anymore to the ASCII codes.  Rather it corresponds
to something like "[aAbBcCdD...zZ]", independent of the actual character
encoding ISO-8859-1 or UTF-8.

In glibc, [a-z] gets translated according to locale collation order. IfA collates before a, then it maps to [aBbCc..Zz], if A collates after a,then it maps to [aAbB...yYz] (notice that in either case, one of the twocapital letters is omitted, so it is NOT the same as all 26 letters inboth cases).

This has been a MUCH complained-about feature of glibc, which has inturn been copied by bash, awk, grep, etc.

Note that POSIX explicitly states that [a-z] has unspecified results inany locale except C. So the glibc behavior is permitted, but so is thetraditional behavior of just the 26 lowercase letters.

If you can convince the glibc folks that [a-z] should have thetraditional behavior, more power to you.


http://lists.gnu.org/archive/html/bug-grep/2010-09/msg00030.html

--
Eric Blake   ebl...@redhat.com    +1-801-349-2682
Libvirt virtualization library http://libvirt.org

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

Re: awk gsub problem

Reply via email to