[resending this email via another email, as there seems to have been a problem the first time around...?]
Hi again, On 23.06.2012, at 18:36, Paul Eggert wrote: > On 06/23/2012 07:54 AM, Paolo Bonzini wrote: >> I'm waiting for feedback from the Gnulib guys. > > Can you please summarize the issue, the proposed fixes, > and the pros and cons of each? The discussion has been > spread out for so long that I've forgotten half of it. Indeed... and now I have been silent for almost two weeks, too *sigh*. > No need for anything fancy; URLs are fine. Thanks. Sure! Let me start by quoting my email from June 6: > 1) On Mac OS X, nl_langinfo(CODESET) returns "US-ASCII" if LC_ALL is set to C > (see also > <http://www.opensource.apple.com/source/Libc/Libc-763.13/locale/nl_langinfo-fbsd.c>) > > 2) On Mac OS X, localcharset.c uses a hard-coded charset.alias table, which > does *not* contain a mapping for US-ASCII, but does contain a catch-all > mapping mapping every unknown charset to UTF-8. Hence, US-ASCII is mapped to > UTF-8. > > 3) Finally, on Mac OS X, MB_CUR_MAX is define as follows: > #define MB_CUR_MAX (___mb_cur_max()) > and that evaluates to 1 when LC_ALL is set to C, and more generally, when > nl_langinfo(CODESET) return "US-ASCII". > > At this point, GNU sed operates under the assumption that encoding is UTF-8, > but uses a MB_CUR_MAX value of 1, which then of course breaks when non-ASCII > multibyte chars pop up. Result: Certain commands fail, e.g. compare Apple/BSD sed with GNU sed (which uses gnulib): > $ echo "Rémi Leblond" | LC_ALL=C /usr/bin/sed -ne "s/.*/'&'/p" > 'Rémi Leblond' > > But using GNU sed does not: > $ echo "Rémi Leblond" | LC_ALL=C gsed -ne "s/.*/'&'/p" > 'R'émi Leblond My proposed fix: Stop trying to second guess the OS. Just add a table entry for "US-ASCII" to the hardcoded one in localcharset.c. This should work fine on all Mac OS X 10.4 and newer. This will *not* break internationalization for most users, as was claimed previously, because Apple's Terminal.app by default sets LANG to reflect the active locale of the user. The exception is if a user explicitly tells Terminal.app not to do that, or manually sets LC_ALL; or if a script does so (such as certain parts of git, which hence are broken when being used in conjunction with GNU sed on Mac OS X -- ouch). An alternative patch was suggested by Paul, which I confirmed to also work. Personally, I find my solution more logical, but Paul certainly knows tons more about these things than I do, and I'll happily defer to him. In the end I don't care how this issue affecting people in real life situations is resolved, as long as it *is* resolved. Here is his proposed patch: --- a/lib/localcharset.c +++ b/lib/localcharset.c @@ -542,5 +542,12 @@ locale_charset (void) if (codeset[0] == '\0') codeset = "ASCII"; +#ifdef DARWIN7 + /* MacOS X sets MB_CUR_MAX to 1 when LC_ALL=C, and "UTF-8" + (the default codeset) does not work when MB_CUR_MAX is 1. */ + if (strcmp (codeset, "UTF-8") == 0 && MB_CUR_MAX <= 1) + codeset = "ASCII"; +#endif + return codeset; } Cheers, Max