Bruno Haible dixit: >Thorsten Glaser wrote: >> Any call to setlocale() in MirBSD is a nop anyway¹. > >Is that true? Do you mean, the programs […] >print en_US.UTF-8 and not C or POSIX?
Hrm. No, both print C, I mis-remembered. • https://www.mirbsd.org/cvs.cgi/src/lib/libc/i18n/charsets.c?rev=1.17 • https://www.mirbsd.org/cvs.cgi/src/lib/libc/i18n/langinfo.c?rev=1.16 Looking at the first (setlocale), LC_CTYPE and LC_ALL will be "en_US.UTF-8" and all others will be "C". The second is nl_langinfo. >In that case, programs cannot even distinguish the C locale from other >locales! .oO(In MirBSD, there is exactly one locale…) >Fortunately GNU gettext already has a workaround against this. OK. >> from what I gathered, back then, other implementations also fall back, >> although, admittedly, to the "C" locale. > >Only OpenBSD and possibly Cygwin do. All other systems leave the locale Mh. OpenBSD heritage, in this case. When I wrote this, I didn’t look up POSIX/SUSv3, other than to see which functions, cpp defines, etc. should be “somehow” implemented, to get UTF-8 working. Retrospectively, this is not good, but back then, I didn’t really know better. >This is precisely what causes the trouble: MirBSD violates POSIX => It >causes porting trouble to the application writers. Yes, although it implements more of POSIX than OpenBSD does, in this respect. (Which will also cause problems because, if things are there, application writers will probably expect them to behave POSIX confor- mantly.) There is no good “way out” at this point ☹ I must admit we don’t want _full_ POSIX conformance (one of its MUST points is actually illegal in Germany, but that’s besides the point), but having it largely co- vered is still desirable (especially for things that carry outside of the BSD base, such as mksh). >It could be understandable to "canonicalize" en_GB to en_US. But canonicalizing >ja_JP to en_US is far-fetched. > >Do you also "canonicalize" en_US.ISO8859-1 to en_US.UTF-8? This would be >an even bigger bug, because the UTF-8 encoding is not the same nor an extension >of the requested ISO-8859-1 encoding. In MirBSD, there is exactly one encoding, called “OPTU-8” internally, which looks like UTF-8 or CESU-8 (doesn’t matter as wchar_t is 16 bit) to the application but is 8-bit transparent over wide char conversion. You see, for our situation, this makes sense, although very twisted. https://www.mirbsd.org/cvs.cgi/src/lib/libc/i18n/charsets.c.diff?r1=1.2;r2=1.3 All this was implemented (more that than really designed, it was sort of a WFM approach, including some reading though – and your libutf8 helped) over five years ago, under a lot of Club-Mate influence… >2010-10-23 Bruno Haible <br...@clisp.org> > > Tests: Fix LOCALE_JA on MirBSD 10. > * m4/locale-ja.m4 (gt_LOCALE_JA): Reject a locale identifier that leads > to an UTF-8 locale. > * m4/locale-fr.m4 (gt_LOCALE_FR): Likewise. > * m4/locale-zh.m4 (gt_LOCALE_ZH_CN): Likewise. > Reported by Eric Blake. Looks good to me. Sorry for your trouble. bye, //mirabilos -- "Using Lynx is like wearing a really good pair of shades: cuts out the glare and harmful UV (ultra-vanity), and you feel so-o-o COOL." -- Henry Nelson, March 1999