Fwd: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8

Paolo Bonzini Fri, 01 Jun 2012 12:53:05 -0700

Here is a report from a GNU sed user.

Paolo


> SETUP:
> $ sw_vers
> ProductName:  Mac OS X
> ProductVersion:       10.7.4
> BuildVersion: 11E53
> 
> $ ~/gnu/bin/sed --version
> GNU sed version 4.2.1
> 
> PROBLEM: With UTF-8 input, but LANG and LC_ALL set to C, sed regular
> expressions break on multibyte sequences. For example (constructed
> from part of a git command):
> 
> $ echo "Rémi Leblond" | LANG=C LC_ALL=C ~/gnu/bin/sed -ne
> 's/.*/GIT_AUTHOR_NAME='\''&'\''/p'
> 
> EXPECTED: GIT_AUTHOR_NAME='Rémi Leblond'
> ACTUAL: GIT_AUTHOR_NAME='R'émi Leblond
> 
> DISCUSSION: The problem starts in sed/lib/localcharset.c,
> locale_charset, line 334
> 
> # if HAVE_LANGINFO_CODESET
> 
>   /* Most systems support nl_langinfo (CODESET) nowadays.  */
>   codeset = nl_langinfo (CODESET);
> 
> Since we set LC_ALL to C, we trigger this code in Libc:
> 
> http://www.opensource.apple.com/source/Libc/Libc-763.13/locale/nl_langinfo-fbsd.c,
> line 54:
> 
>       case CODESET:
>               ret = "";
>               if ((s = querylocale(LC_CTYPE_MASK, loc)) != NULL) {
>                       if ((cs = strchr(s, '.')) != NULL)
>                               ret = cs + 1;
>                       else if (strcmp(s, "C") == 0 ||
>                                strcmp(s, "POSIX") == 0)
>                               ret = "US-ASCII";
>                       else if (strcmp(s, "UTF-8") == 0)
>                               ret = "UTF-8";
>               }
>               break;
> 
> As you can see, querylocale() will return "C", and
> nl_langinfo(CODESET) will return "US-ASCII". The other thing to
> realize is that on OS X MB_CUR_MAX is a macro for ___mb_cur_max(),
> which returns 1 when LC_ALL is C.
> 
> Back to sed/lib/localcharset.c, we end up at locale_charset(), line 483:
> 
>   /* Resolve alias. */
>   for (aliases = get_charset_aliases ();
>        *aliases != '\0';
>        aliases += strlen (aliases) + 1, aliases += strlen (aliases) + 1)
>     if (strcmp (codeset, aliases) == 0
>       || (aliases[0] == '*' && aliases[1] == '\0'))
>       {
>       codeset = aliases + strlen (aliases) + 1;
>       break;
>       }
> 
> This tries to alias our charset, "US-ASCII", to something sed
> understands. get_charset_aliases() is at line 112 in the same file. On
> OS X 10.7, DARWIN7 is defined (always for OS X 10.3 or newer), so we
> end up at line 223:
> 
>       /* To avoid the trouble of installing a file that is shared by many
>        GNU packages -- many packaging systems have problems with this --,
>        simply inline the aliases here.  */
>       cp = "ISO8859-1" "\0" "ISO-8859-1" "\0"
>          "ISO8859-2" "\0" "ISO-8859-2" "\0"
>          "ISO8859-4" "\0" "ISO-8859-4" "\0"
>          "ISO8859-5" "\0" "ISO-8859-5" "\0"
>          "ISO8859-7" "\0" "ISO-8859-7" "\0"
>          "ISO8859-9" "\0" "ISO-8859-9" "\0"
>          "ISO8859-13" "\0" "ISO-8859-13" "\0"
>          "ISO8859-15" "\0" "ISO-8859-15" "\0"
>          "KOI8-R" "\0" "KOI8-R" "\0"
>          "KOI8-U" "\0" "KOI8-U" "\0"
>          "CP866" "\0" "CP866" "\0"
>          "CP949" "\0" "CP949" "\0"
>          "CP1131" "\0" "CP1131" "\0"
>          "CP1251" "\0" "CP1251" "\0"
>          "eucCN" "\0" "GB2312" "\0"
>          "GB2312" "\0" "GB2312" "\0"
>          "eucJP" "\0" "EUC-JP" "\0"
>          "eucKR" "\0" "EUC-KR" "\0"
>          "Big5" "\0" "BIG5" "\0"
>          "Big5HKSCS" "\0" "BIG5-HKSCS" "\0"
>          "GBK" "\0" "GBK" "\0"
>          "GB18030" "\0" "GB18030" "\0"
>          "SJIS" "\0" "SHIFT_JIS" "\0"
>          "ARMSCII-8" "\0" "ARMSCII-8" "\0"
>          "PT154" "\0" "PT154" "\0"
>        /*"ISCII-DEV" "\0" "?" "\0"*/
>          "*" "\0" "UTF-8" "\0";
> 
> And here is the root problem. This table does not have an entry for
> US-ASCII. So it catches the default entry, "*", which maps everything
> to "UTF-8", and that's what get_charset_aliases() returns, and what
> locale_charset(), which then sets a UTF-8 flag in sed that gets used
> by many parts.
> 
> But this is dangerous, because now UTF-8 is set but MB_CUR_MAX is 1
> and various parts of sed interpret "Rémi Leblond" as an invalid
> character sequence for a UTF-8 character set. This is why /.*/ in the
> regular expression only matches the "R" before bailing on the "é".
> 
> POSIX says that the "C" locale should treat text data is binary input,
> but in this situation sed is trying to treat it as a multibyte
> encoding.
> 
> FIX: the DARWIN7 table in get_charset_aliases() should not contain a
> default that maps everything not defined to "UTF-8". Or at the very
> least, it should include an entry for "US-ASCII" that maps to "ASCII",
> as a charset.aliases file might.
>

Fwd: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8

Reply via email to