Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8

Max Horn Sun, 10 Jun 2012 15:31:17 -0700

Hi again,


Am 07.06.2012 um 14:07 schrieb Bruno Haible:

[...]

> 
>> But this is dangerous, because now UTF-8 is set but MB_CUR_MAX is 1
>> and various parts of sed interpret "Rémi Leblond" as an invalid
>> character sequence for a UTF-8 character set.
> 
> Indeed, I can see how this inconsistency leads to bugs like the described
> ones.
> 
> The fix could be to have two different locale_charset() functions,
> one that returns "US-ASCII" and another one that returns "UTF-8".
> The first one to be used when MB_CUR_MAX and mbrtowc() are used as
> well, the second one to be used by gettext(). But the separation
> line between the two cases is not yet clear to me. Any insights?

Hum, that sounds quite complicated -- could you explain what this would gain 
over the idea of simply mapping "US-ASCII" to "ASCII", or over the patch Paul 
suggested:

> --- a/lib/localcharset.c
> +++ b/lib/localcharset.c
> @@ -542,5 +542,12 @@ locale_charset (void)
>   if (codeset[0] == '\0')
>     codeset = "ASCII";
> 
> +#ifdef DARWIN7
> +  /* MacOS X sets MB_CUR_MAX to 1 when LC_ALL=C, and "UTF-8"
> +     (the default codeset) does not work when MB_CUR_MAX is 1.  */
> +  if (strcmp (codeset, "UTF-8") == 0 && MB_CUR_MAX <= 1)
> +    codeset = "ASCII";
> +#endif
> +
>   return codeset;
> }


Cheers,
Max

Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8

Reply via email to