Paul Eggert wrote: > the GNU gettext manual says: > > Note that the MSGID argument to `gettext' is not subject to > character set conversion. Also, when `gettext' does not find a > translation for MSGID, it returns MSGID unchanged - independently of > the current output character set. It is therefore recommended that all > MSGIDs be US-ASCII strings.
This recommendation is directed to the "normal" use of xgettext, i.e. extraction of the msgids from source code. The other issue - not mentioned in the GNU gettext manual, but quite important - is that source code should be viewable in different encodings, and when you convert some source code from ISO-8859-1 to UTF-8 (or vice versa), the behaviour of the program should remain the same. The situation for iso-codes is different, because - It is not extracted from source code; the use of XML files for the list of country/location names greatly reduces the possible problems when these files would be stored in a different encoding (thanks to the encoding declaration present in XML files). - There are quite a number of languages/countries/locations in the world which cannot be written in ASCII, such as Norwegian Bokmål, Côte d'Ivoire, etc. Therefore I think it's actually OK for iso-codes to use UTF-8 as encoding of the msgids. The only remaining problem is in the C code: A program running in, say, an EUC-JP locale, needs to be a little careful when accessing the message catalog: not just country_translation = dgettext ("iso-codes", country_english_utf8); but country_translation = dgettext ("iso-codes", country_english_utf8); if (country_translation == country_english_utf8) { /* Not found in the message catalog. Use the English name, converted to the correct encoding. */ country_translation = iconv_string (country_translation, "UTF-8", locale_charset ()); } You find code that is a little better than this one (cares about transliteration, non-canonicalized locale_charset() result etc.) in propername.c at http://cvs.savannah.gnu.org/viewcvs/*checkout*/gettext/gettext-tools/lib/propername.c?content-type=text%2Fplain&rev=1.1&root=gettext In other words, UTF-8 is the current de-facto standard encoding. I would leave the iso-codes PO files in that encoding, and keep the support of other encodings purely in the C code that uses the ,mo files. > Can the format of the XML country list be extended to contain two > spellings, one in UTF-8, one ASCII-ized? Then the algorithm wouldn't > need to transcode. The transliteration in glibc and libiconv is good enough. Bruno