[gentoo-dev] RFC: BCP 47 for L10N? (was: News item: LINGUAS USE_EXPAND renamed to L10N)

Ulrich Mueller Fri, 10 Jun 2016 02:30:07 -0700

>>>>> On Tue, 7 Jun 2016, Chí-Thanh Christopher Nguyễn wrote:


>> 4. According to Gettext documentation, "'@VARIANT' can denote any
>> kind of characteristics that is not already implied by the language
>> LL and the country CC." (So IIUC the BCP-47 variant "valencia"
>> would become "@valencia".)

> This I think is wrong and collides with POSIX.
> POSIX modifiers are not allowed for LANG or LC_ALL in
> POSIX.1-2008[1] Section 8.2 says you can have at most one modifier
> field to "select a specific instance of localization data within a
> single category", which I don't think applies because it is its own
> locale, not an instance of an existing one. Furthermore (but that
> doesn't apply in our use case), POSIX spec lists the example
> LC_COLLATE=De_DE@dict
> So what if you want Catalan Valencian with dictionary order? Or if
> someone hypothetically came up with a different script?

>> I haven't found any mention or usage of ISO 3166-2 region
>> subdivisions in the context of locale. Can you provide any
>> references for this?

> As I wrote before, it is not used. But I think it is the only
> spec-compliant way to marry POSIX locales with Catalan Valencian.
> BCP-47 does it in a more natural way.

So, trying to summarise: We cannot follow strict POSIX syntax, so our
two choices are either to stick to Gettext LL_CC@VARIANT syntax or
to change to BCP 47.

Using BCP 47 would have some advantages:
- It is a well defined standard [1] and tools for validation of
  language tags exist, e.g. [2].
- The L10N USE_EXPAND could follow usual USE flag syntax, as BCP 47
  tags contain neither underscores (which are supposed to be reserved
  as USE_EXPAND separators) nor @ signs (which PMS explicitly
  mentions as an exception for LINGUAS).
- Gettext's @VARIANT is ill-defined and conflates different
  characteristics like script and variant. There is no further
  subdivision within @VARIANT, which leads to locale names like
  sr@ijekavianlatin. Also different upstreams use different
  conventions, like @latin and @Latn for the latin script.
- For the vast majority of languages, identifiers are either identical
  ("de" -> "de") or they can be converted by simple shell substitution
  ("pt-BR" -> "pt_BR").
- IIUC, L10N is primarily intended to control things like additional
  language bundles of packages. Some upstreams like libreoffice
  already use BCP 47 for these.

On the other hand, there will be some cost:
- If BCP 47 tags containing a script or a variant should be used to
  generate LINGUAS, they will require explicit mapping. (OTOH, such
  mapping will also be needed if we stick to Gettext syntax but unify
  variants like "sr@latin" and "sr@Latn".)
- Different syntax for LINGUAS and L10N might be confusing to users,
  so additional documentation will be needed.

Comments?

Ulrich

[1] https://tools.ietf.org/html/bcp47
[2] http://schneegans.de/lv/

pgp7lGNFQNBw3.pgp
Description: PGP signature

[gentoo-dev] RFC: BCP 47 for L10N? (was: News item: LINGUAS USE_EXPAND renamed to L10N)

Reply via email to