On 06/01/2012 08:52 PM, Paolo Bonzini wrote: > Here is a report from a GNU sed user. > > Paolo > >> SETUP: >> $ sw_vers >> ProductName: Mac OS X >> ProductVersion: 10.7.4 >> BuildVersion: 11E53 >> >> $ ~/gnu/bin/sed --version >> GNU sed version 4.2.1 >> >> PROBLEM: With UTF-8 input, but LANG and LC_ALL set to C, sed regular >> expressions break on multibyte sequences. For example (constructed >> from part of a git command): >> >> $ echo "Rémi Leblond" | LANG=C LC_ALL=C ~/gnu/bin/sed -ne >> 's/.*/GIT_AUTHOR_NAME='\''&'\''/p' >> >> EXPECTED: GIT_AUTHOR_NAME='Rémi Leblond' >> ACTUAL: GIT_AUTHOR_NAME='R'émi Leblond >> >> DISCUSSION: The problem starts in sed/lib/localcharset.c, >> locale_charset, line 334 >> >> # if HAVE_LANGINFO_CODESET >> >> /* Most systems support nl_langinfo (CODESET) nowadays. */ >> codeset = nl_langinfo (CODESET); >> >> Since we set LC_ALL to C, we trigger this code in Libc: >> >> http://www.opensource.apple.com/source/Libc/Libc-763.13/locale/nl_langinfo-fbsd.c, >> line 54: >> >> case CODESET: >> ret = ""; >> if ((s = querylocale(LC_CTYPE_MASK, loc)) != NULL) { >> if ((cs = strchr(s, '.')) != NULL) >> ret = cs + 1; >> else if (strcmp(s, "C") == 0 || >> strcmp(s, "POSIX") == 0) >> ret = "US-ASCII"; >> else if (strcmp(s, "UTF-8") == 0) >> ret = "UTF-8"; >> } >> break; >> >> As you can see, querylocale() will return "C", and >> nl_langinfo(CODESET) will return "US-ASCII". The other thing to >> realize is that on OS X MB_CUR_MAX is a macro for ___mb_cur_max(), >> which returns 1 when LC_ALL is C. >> >> Back to sed/lib/localcharset.c, we end up at locale_charset(), line 483: >> >> /* Resolve alias. */ >> for (aliases = get_charset_aliases (); >> *aliases != '\0'; >> aliases += strlen (aliases) + 1, aliases += strlen (aliases) + 1) >> if (strcmp (codeset, aliases) == 0 >> || (aliases[0] == '*' && aliases[1] == '\0')) >> { >> codeset = aliases + strlen (aliases) + 1; >> break; >> } >> >> This tries to alias our charset, "US-ASCII", to something sed >> understands. get_charset_aliases() is at line 112 in the same file. On >> OS X 10.7, DARWIN7 is defined (always for OS X 10.3 or newer), so we >> end up at line 223: >> >> /* To avoid the trouble of installing a file that is shared by many >> GNU packages -- many packaging systems have problems with this --, >> simply inline the aliases here. */ >> cp = "ISO8859-1" "\0" "ISO-8859-1" "\0" >> "ISO8859-2" "\0" "ISO-8859-2" "\0" >> "ISO8859-4" "\0" "ISO-8859-4" "\0" >> "ISO8859-5" "\0" "ISO-8859-5" "\0" >> "ISO8859-7" "\0" "ISO-8859-7" "\0" >> "ISO8859-9" "\0" "ISO-8859-9" "\0" >> "ISO8859-13" "\0" "ISO-8859-13" "\0" >> "ISO8859-15" "\0" "ISO-8859-15" "\0" >> "KOI8-R" "\0" "KOI8-R" "\0" >> "KOI8-U" "\0" "KOI8-U" "\0" >> "CP866" "\0" "CP866" "\0" >> "CP949" "\0" "CP949" "\0" >> "CP1131" "\0" "CP1131" "\0" >> "CP1251" "\0" "CP1251" "\0" >> "eucCN" "\0" "GB2312" "\0" >> "GB2312" "\0" "GB2312" "\0" >> "eucJP" "\0" "EUC-JP" "\0" >> "eucKR" "\0" "EUC-KR" "\0" >> "Big5" "\0" "BIG5" "\0" >> "Big5HKSCS" "\0" "BIG5-HKSCS" "\0" >> "GBK" "\0" "GBK" "\0" >> "GB18030" "\0" "GB18030" "\0" >> "SJIS" "\0" "SHIFT_JIS" "\0" >> "ARMSCII-8" "\0" "ARMSCII-8" "\0" >> "PT154" "\0" "PT154" "\0" >> /*"ISCII-DEV" "\0" "?" "\0"*/ >> "*" "\0" "UTF-8" "\0"; >> >> And here is the root problem. This table does not have an entry for >> US-ASCII. So it catches the default entry, "*", which maps everything >> to "UTF-8", and that's what get_charset_aliases() returns, and what >> locale_charset(), which then sets a UTF-8 flag in sed that gets used >> by many parts. >> >> But this is dangerous, because now UTF-8 is set but MB_CUR_MAX is 1 >> and various parts of sed interpret "Rémi Leblond" as an invalid >> character sequence for a UTF-8 character set. This is why /.*/ in the >> regular expression only matches the "R" before bailing on the "é". >> >> POSIX says that the "C" locale should treat text data is binary input, >> but in this situation sed is trying to treat it as a multibyte >> encoding. >> >> FIX: the DARWIN7 table in get_charset_aliases() should not contain a >> default that maps everything not defined to "UTF-8". Or at the very >> least, it should include an entry for "US-ASCII" that maps to "ASCII", >> as a charset.aliases file might.
So this is the third time this change has been proposed: If you following the previous one: http://lists.gnu.org/archive/html/bug-gnulib/2012-03/threads.html#00104 it will refer to Bruno's argument for not changing this: http://lists.gnu.org/archive/html/bug-gnulib/2012-01/msg00342.html It's very unfortunate that US-ASCII doesn't reflect reality on Mac OS X. I don't have such a system to test this myself unfortunately. cheers, Pádraig.