Bug#330990: iso-codes .pot msgid strings contain non-ASCII characters

Bruno Haible Mon, 24 Apr 2006 14:52:31 -0700

Paul Eggert wrote:
> the GNU gettext manual says:
>
>       Note that the MSGID argument to `gettext' is not subject to
>    character set conversion.  Also, when `gettext' does not find a
>    translation for MSGID, it returns MSGID unchanged - independently of
>    the current output character set.  It is therefore recommended that all
>    MSGIDs be US-ASCII strings.


This recommendation is directed to the "normal" use of xgettext, i.e.
extraction of the msgids from source code. The other issue - not mentioned
in the GNU gettext manual, but quite important - is that source code should
be viewable in different encodings, and when you convert some source code
from ISO-8859-1 to UTF-8 (or vice versa), the behaviour of the program
should remain the same.

The situation for iso-codes is different, because
  - It is not extracted from source code; the use of XML files for the
    list of country/location names greatly reduces the possible problems
    when these files would be stored in a different encoding (thanks to
    the encoding declaration present in XML files).
  - There are quite a number of languages/countries/locations in the world
    which cannot be written in ASCII, such as Norwegian Bokmål, Côte
    d'Ivoire, etc.

Therefore I think it's actually OK for iso-codes to use UTF-8 as encoding
of the msgids.

The only remaining problem is in the C code: A program running in, say, an
EUC-JP locale, needs to be a little careful when accessing the message
catalog: not just

      country_translation = dgettext ("iso-codes", country_english_utf8);

but

      country_translation = dgettext ("iso-codes", country_english_utf8);
      if (country_translation == country_english_utf8)
        {
          /* Not found in the message catalog. Use the English name, converted
             to the correct encoding.  */
          country_translation =
            iconv_string (country_translation, "UTF-8", locale_charset ());
        }

You find code that is a little better than this one (cares about 
transliteration,
non-canonicalized locale_charset() result etc.) in propername.c at

  
http://cvs.savannah.gnu.org/viewcvs/*checkout*/gettext/gettext-tools/lib/propername.c?content-type=text%2Fplain&rev=1.1&root=gettext

In other words, UTF-8 is the current de-facto standard encoding. I would leave
the iso-codes PO files in that encoding, and keep the support of other encodings
purely in the C code that uses the ,mo files.

> Can the format of the XML country list be extended to contain two
> spellings, one in UTF-8, one ASCII-ized?  Then the algorithm wouldn't
> need to transcode.

The transliteration in glibc and libiconv is good enough.

Bruno

Bug#330990: iso-codes .pot msgid strings contain non-ASCII characters

Reply via email to