Re: MirBSD mbtowc bug? failure on test-wcrtomb

Thorsten Glaser Sat, 23 Oct 2010 09:06:19 -0700

Bruno Haible dixit:

>Thorsten Glaser wrote:
>> Any call to setlocale() in MirBSD is a nop anyway¹.
>
>Is that true? Do you mean, the programs
[…]
>print en_US.UTF-8 and not C or POSIX?


Hrm. No, both print C, I mis-remembered.
• https://www.mirbsd.org/cvs.cgi/src/lib/libc/i18n/charsets.c?rev=1.17
• https://www.mirbsd.org/cvs.cgi/src/lib/libc/i18n/langinfo.c?rev=1.16

Looking at the first (setlocale), LC_CTYPE and LC_ALL will be
"en_US.UTF-8" and all others will be "C". The second is nl_langinfo.

>In that case, programs cannot even distinguish the C locale from other
>locales!

.oO(In MirBSD, there is exactly one locale…)

>Fortunately GNU gettext already has a workaround against this.

OK.

>> from what I gathered, back then, other implementations also fall back,
>> although, admittedly, to the "C" locale.
>
>Only OpenBSD and possibly Cygwin do. All other systems leave the locale

Mh. OpenBSD heritage, in this case. When I wrote this, I didn’t look up
POSIX/SUSv3, other than to see which functions, cpp defines, etc. should
be “somehow” implemented, to get UTF-8 working. Retrospectively, this is
not good, but back then, I didn’t really know better.

>This is precisely what causes the trouble: MirBSD violates POSIX => It
>causes porting trouble to the application writers.

Yes, although it implements more of POSIX than OpenBSD does, in this
respect. (Which will also cause problems because, if things are there,
application writers will probably expect them to behave POSIX confor-
mantly.)

There is no good “way out” at this point ☹ I must admit we don’t want
_full_ POSIX conformance (one of its MUST points is actually illegal
in Germany, but that’s besides the point), but having it largely co-
vered is still desirable (especially for things that carry outside of
the BSD base, such as mksh).

>It could be understandable to "canonicalize" en_GB to en_US. But canonicalizing
>ja_JP to en_US is far-fetched.
>
>Do you also "canonicalize" en_US.ISO8859-1 to en_US.UTF-8? This would be
>an even bigger bug, because the UTF-8 encoding is not the same nor an extension
>of the requested ISO-8859-1 encoding.

In MirBSD, there is exactly one encoding, called “OPTU-8” internally,
which looks like UTF-8 or CESU-8 (doesn’t matter as wchar_t is 16 bit)
to the application but is 8-bit transparent over wide char conversion.

You see, for our situation, this makes sense, although very twisted.

https://www.mirbsd.org/cvs.cgi/src/lib/libc/i18n/charsets.c.diff?r1=1.2;r2=1.3
All this was implemented (more that than really designed, it was sort
of a WFM approach, including some reading though – and your libutf8
helped) over five years ago, under a lot of Club-Mate influence…

>2010-10-23  Bruno Haible  <br...@clisp.org>
>
>       Tests: Fix LOCALE_JA on MirBSD 10.
>       * m4/locale-ja.m4 (gt_LOCALE_JA): Reject a locale identifier that leads
>       to an UTF-8 locale.
>       * m4/locale-fr.m4 (gt_LOCALE_FR): Likewise.
>       * m4/locale-zh.m4 (gt_LOCALE_ZH_CN): Likewise.
>       Reported by Eric Blake.

Looks good to me. Sorry for your trouble.

bye,
//mirabilos
-- 
  "Using Lynx is like wearing a really good pair of shades: cuts out
   the glare and harmful UV (ultra-vanity), and you feel so-o-o COOL."
                                         -- Henry Nelson, March 1999

Re: MirBSD mbtowc bug? failure on test-wcrtomb

Reply via email to