Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8

Max Horn Wed, 06 Jun 2012 05:02:41 -0700

Dear all,

I subscribed to this list solely to respond to this thread, and hopefully help 
iron out issues caused by installing GNU sed on Mac OS X (regressions in 
behavior compared to the system's sed). My interest stems from that fact that I 
am the maintainer of the GNU sed package in the Fink package management system 
(for reference, see <http://www.finkproject.org/> and 
<http://pdb.finkproject.org/pdb/package.php/sed>).


To recall, a simplified way to reproduce the issue involves setting LC_ALL=C, 
and then feeding some UTF-8 data into sed. Using the sed shipped with Mac OS X, 
this works fine:

$ echo "Rémi Leblond" | LC_ALL=C /usr/bin/sed -ne "s/.*/'&'/p"
'Rémi Leblond'

But using GNU sed does not:
$ echo "Rémi Leblond" | LC_ALL=C gsed -ne "s/.*/'&'/p"
'R'émi Leblond


Let me sum up the reasons for this:

1) On Mac OS X, nl_langinfo(CODESET) returns "US-ASCII" if LC_ALL is set to C
  (see also 
<http://www.opensource.apple.com/source/Libc/Libc-763.13/locale/nl_langinfo-fbsd.c>)


2) On Mac OS X, localcharset.c uses a hard-coded charset.alias table, which 
does *not* contain a mapping for US-ASCII, but does contain a catch-all mapping 
mapping every unknown charset to UTF-8. Hence, US-ASCII is mapped to UTF-8.

3) Finally, on Mac OS X, MB_CUR_MAX is define as follows:
   #define      MB_CUR_MAX      (___mb_cur_max())
and that evaluates to 1 when LC_ALL is set to C, and more generally, when 
nl_langinfo(CODESET) return "US-ASCII".

At this point, GNU sed operates under the assumption that encoding is UTF-8, 
but uses a MB_CUR_MAX value of 1, which then of course breaks when non-ASCII 
multibyte chars pop up.

One way to fix this was already proposed: Nameyl to add a mapping from 
"US-ASCII" to "ASCII" into the built-in conversion table. This was rejected by 
Bruno Haible  in
 <http://lists.gnu.org/archive/html/bug-gnulib/2012-01/msg00342.html> with the 
argument:

> Nah. "Let's break gettext() based internationalization of all GNU programs
> for most MacOS X users" won't get my approval.

Well, I strongly disagree with this statement. First off, to me, breaking i18n 
would actually be preferable over breaking shell scripts that run fine 
elsewhere or with Apple's sed. But as a matter of fact, I don't think that i18n 
would be broken at all (or if, then only for a small minority of power-users 
who know how to deal with it). As far as I can tell, this breakage claim is 
based on an incorrect claim farther up in Bruno's email, to quote:

> [...] Therefore the normal situation on MacOS X is this:
>       $ env | grep LC_
>       $ locale
>       LANG=
>       LC_COLLATE="C"
>       LC_CTYPE="C"
>       LC_MESSAGES="C"
>       LC_MONETARY="C"
>       LC_NUMERIC="C"
>       LC_TIME="C"
>       LC_ALL=

That's not correct (though it might have been several years ago). Because 
Terminal.app on Mac OS X has an option "Set LANG environment variable" which is 
enabled by default. As a result, on my system (set to German locale by default) 
for example I get this:

$ locale
LANG="de_DE.UTF-8"
LC_COLLATE="de_DE.UTF-8"
LC_CTYPE="de_DE.UTF-8"
LC_MESSAGES="de_DE.UTF-8"
LC_MONETARY="de_DE.UTF-8"
LC_NUMERIC="de_DE.UTF-8"
LC_TIME="de_DE.UTF-8"
LC_ALL=

Switching to french, I get the suitable correct values, too (at least in 
freshly opened terminals). 


There is another claim that appears to be incorrect to me (but perhaps I simply 
misunderstand it):

> There are several systems with locale encoding UTF-8 in the all user
> locales: Plan 9, BeOS, Haiku, MacOS X, Cygwin 1.7, and there will be more,
> because it's a natural choice nowadays. [...]

While Mac OS X defaults to UTF-8, it also supports other encoding in user 
locales. For example, the following German locales are supported:
  de_DE
  de_DE.ISO8859-1
  de_DE.ISO8859-15
  de_DE.UTF-8


Indeed, on the same Terminal.app preference page on which one can toggle the 
"Set LANG environment variable" setting, one can also choose another encoding.

In summary, I don't see any harm with adding the "US-ASCII" => "ASCII" mapping 
to the hardcoded charset.alias table. To the contrary, it resolves all issues 
with this code known to me; and always defaulting to UTF-8 doesn't seem 
sensible either, in light of the fact that we cannot safely rely on the active 
encoding to be UTF-8. So, not trying to outguess the OS seems to me to be the 
preferable route here...

And it seems to me as if with this change, localization / internationalization 
in GNU apps would still work fine, at least under normal circumstances. Only 
"Power users" who chose to disable the "Set LANG environment variable" will 
have to deal with the consequences; but they should be able to set their LANG / 
LC_ALL / etc. env variables according to their needs.

But perhaps I am totally missing out on something -- in that case, I hope you 
can teach me about that, and perhaps we can come up with another way to improve 
the overall experience for the most important party involved here: The people 
using sed :-)


Cheers,
Max

Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8

Reply via email to