On Mon, Apr 06, 2009 at 11:09:17AM -0700, Steve Langasek wrote: > On Mon, Apr 06, 2009 at 05:33:35PM +0000, Thorsten Glaser wrote: > > > If you need a specific locale (as seems from "mksh", not > > > sure if it is a bug in that program), you need to set it. > > > You can only set a locale on a glibc-based system if it’s > > installed beforehand, which root needs to do. > > You can build-depend on the locales package and generate the locales you > want locally, using LOCPATH to reference them. There's no need for Debian > to guarantee the presence of a particular locale ahead of time - > particularly one that isn't actually useful to end users, as C.UTF-8 would > be.
I think that it would be very useful, I'll detail why below. The GCC toolchain has, for some time now, been using UTF-8 as the internal representation for narrow strings (-fexec-charset). It has also been using UTF-8 as the default input encoding for C source code (-finput-charset). This means that unless you take any special measures, your program will be outputting UTF-8 strings for all file and terminal I/O. Of course, this is backward compatible with ASCII, and is also transcoded automatically when in a non-UTF-8 locale. I've attached a trivial example. Just to be clear: this handling is completely built into GCC and libc, and is completely transparent. Now, this will work fine in all locales *except for C/POSIX*. Obviously the charsets of some locales can't represent all the characters used in this example, but the C library will actually transcode (iconv) to the locale codeset as best it can. Except for C/POSIX. Now, why is this needed? If I write a program, I might want to use non-ASCII UTF-8 characters in the sources. We have been doing this for years without realising since GCC switched to UTF-8 as the default internal encoding, but simply for portability when using the C locale we are restricted to using ASCII only in the sources, and then a translation library such as libintl/gettext to get translated strings with the extended characters in them. This is workable, but it imposes a big burden on translators because I might want to use symbols and other characters which are not part of a /language/ translation, but need adding by each and every translator through explicit translator comments in the sources. This is tedious and error-prone. If the sources were UTF-8 encoded, this would work perfectly since I could just use the necessary UTF-8 characters directly in the source rather than abusing the translation machinery to avoid non-ASCII codes. A UTF-8 C locale thus cuts out a big pile of cruft and complexity in sources which only exists to cater for people who want to run your code in a C locale! And the translators can completely ignore the now no longer needed job of translating special characters as well doing as the actual translation work, so the symbol usage is identical in all translations, and their job is much easier. I've tested all this, and it all works *perfectly*. Except that if you do this, your program will not run in the C locale (and *only* the C locale) due to having completely borked output. A C.UTF-8 would be a solution to this problem, and allow full use of the *existing* UTF-8 string handling which all sources are built with, yet only a tiny fraction dare to use. Note that gettext is *completely disabled* if used in a C locale, and this does additional mangling in addition to the plain libc damage, resulting in *no output at all*! (I would need to double check that; this was the case when I last looked, and the reason I had to abandon use of UTF-8 string literals.) There are other uses for a UTF-8 C locale as well. I've needed at several times a UTF-8 locale at build time for various tasks, mainly related to translation work. While you mentioned it's possible to do this by generation of locales at build time, in practice I've found this rather error prone and unreliable. Having the C locale (which is the locale all our buildds use by default) UTF-8 by default would make these jobs much easier. Some of the projects I work on such as gutenprint have needed to reimplement some of the gettext internals to work around this in a portable manner. Regarding the standards conformance of using a UTF-8 C locale: I've spent some time reading the standards (SUSv3), and see no reason why C can't use UTF-8 as its default codeset and still remain strictly conforming. The standards specifies a minimum requirement of a portable character set and control character set. This is satisfied by the 7-bit ASCII encoding which we currently use as the C0 and G1 control and graphics sets. However, UTF-8 is a strict 8-bit superset of this standard, and it is eminently reasonable to use UTF-8 *and still remain conforming* with the minimum functionality required by the standard. It's explicity spelled out in SUSv2, though the wording was dropped in SUSv3 (definitely not forbidden, though). POSIX/C locale: http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap07.html#tag_07_02 Portable charset: http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap06.html#tag_06 "Implementations may also add other characters." This is from the charset documentation in SUSv2 http://opengroup.org/onlinepubs/007908775/xbd/charset.html UTF-8 is the default character set on Debian GNU/Linux. It's what we all use, it's what all the tools use, and the C locale is the last ASCII holdout. It would make the lives of many maintainers and users more bearable if it was also UTF-8, as well as getting rid of the current buggy behaviour if you use UTF-8-encoded sources. It's currently *the only blocker* preventing us using UTF-8 encoded sources. Regards, Roger -- .''`. Roger Leigh : :' : Debian GNU/Linux http://people.debian.org/~rleigh/ `. `' Printing on GNU/Linux? http://gutenprint.sourceforge.net/ `- GPG Public Key: 0x25BFB848 Please GPG sign your mail. -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org