Dear all, I subscribed to this list solely to respond to this thread, and hopefully help iron out issues caused by installing GNU sed on Mac OS X (regressions in behavior compared to the system's sed). My interest stems from that fact that I am the maintainer of the GNU sed package in the Fink package management system (for reference, see <http://www.finkproject.org/> and <http://pdb.finkproject.org/pdb/package.php/sed>).
To recall, a simplified way to reproduce the issue involves setting LC_ALL=C, and then feeding some UTF-8 data into sed. Using the sed shipped with Mac OS X, this works fine: $ echo "Rémi Leblond" | LC_ALL=C /usr/bin/sed -ne "s/.*/'&'/p" 'Rémi Leblond' But using GNU sed does not: $ echo "Rémi Leblond" | LC_ALL=C gsed -ne "s/.*/'&'/p" 'R'émi Leblond Let me sum up the reasons for this: 1) On Mac OS X, nl_langinfo(CODESET) returns "US-ASCII" if LC_ALL is set to C (see also <http://www.opensource.apple.com/source/Libc/Libc-763.13/locale/nl_langinfo-fbsd.c>) 2) On Mac OS X, localcharset.c uses a hard-coded charset.alias table, which does *not* contain a mapping for US-ASCII, but does contain a catch-all mapping mapping every unknown charset to UTF-8. Hence, US-ASCII is mapped to UTF-8. 3) Finally, on Mac OS X, MB_CUR_MAX is define as follows: #define MB_CUR_MAX (___mb_cur_max()) and that evaluates to 1 when LC_ALL is set to C, and more generally, when nl_langinfo(CODESET) return "US-ASCII". At this point, GNU sed operates under the assumption that encoding is UTF-8, but uses a MB_CUR_MAX value of 1, which then of course breaks when non-ASCII multibyte chars pop up. One way to fix this was already proposed: Nameyl to add a mapping from "US-ASCII" to "ASCII" into the built-in conversion table. This was rejected by Bruno Haible in <http://lists.gnu.org/archive/html/bug-gnulib/2012-01/msg00342.html> with the argument: > Nah. "Let's break gettext() based internationalization of all GNU programs > for most MacOS X users" won't get my approval. Well, I strongly disagree with this statement. First off, to me, breaking i18n would actually be preferable over breaking shell scripts that run fine elsewhere or with Apple's sed. But as a matter of fact, I don't think that i18n would be broken at all (or if, then only for a small minority of power-users who know how to deal with it). As far as I can tell, this breakage claim is based on an incorrect claim farther up in Bruno's email, to quote: > [...] Therefore the normal situation on MacOS X is this: > $ env | grep LC_ > $ locale > LANG= > LC_COLLATE="C" > LC_CTYPE="C" > LC_MESSAGES="C" > LC_MONETARY="C" > LC_NUMERIC="C" > LC_TIME="C" > LC_ALL= That's not correct (though it might have been several years ago). Because Terminal.app on Mac OS X has an option "Set LANG environment variable" which is enabled by default. As a result, on my system (set to German locale by default) for example I get this: $ locale LANG="de_DE.UTF-8" LC_COLLATE="de_DE.UTF-8" LC_CTYPE="de_DE.UTF-8" LC_MESSAGES="de_DE.UTF-8" LC_MONETARY="de_DE.UTF-8" LC_NUMERIC="de_DE.UTF-8" LC_TIME="de_DE.UTF-8" LC_ALL= Switching to french, I get the suitable correct values, too (at least in freshly opened terminals). There is another claim that appears to be incorrect to me (but perhaps I simply misunderstand it): > There are several systems with locale encoding UTF-8 in the all user > locales: Plan 9, BeOS, Haiku, MacOS X, Cygwin 1.7, and there will be more, > because it's a natural choice nowadays. [...] While Mac OS X defaults to UTF-8, it also supports other encoding in user locales. For example, the following German locales are supported: de_DE de_DE.ISO8859-1 de_DE.ISO8859-15 de_DE.UTF-8 Indeed, on the same Terminal.app preference page on which one can toggle the "Set LANG environment variable" setting, one can also choose another encoding. In summary, I don't see any harm with adding the "US-ASCII" => "ASCII" mapping to the hardcoded charset.alias table. To the contrary, it resolves all issues with this code known to me; and always defaulting to UTF-8 doesn't seem sensible either, in light of the fact that we cannot safely rely on the active encoding to be UTF-8. So, not trying to outguess the OS seems to me to be the preferable route here... And it seems to me as if with this change, localization / internationalization in GNU apps would still work fine, at least under normal circumstances. Only "Power users" who chose to disable the "Set LANG environment variable" will have to deal with the consequences; but they should be able to set their LANG / LC_ALL / etc. env variables according to their needs. But perhaps I am totally missing out on something -- in that case, I hope you can teach me about that, and perhaps we can come up with another way to improve the overall experience for the most important party involved here: The people using sed :-) Cheers, Max