user lintian-ma...@debian.org
usertag 1013946 + false-positive unknown-locale-code
tag 1013946 + confirmed
retitle 1013946 lintian: [FP] Wrongly reports unknown-locale-code "ber" (POSIX 
locales: ISO 639-2 vs 639-3 vs 639-5)
kthxbye

Hi Fabio,

Fabio Fantoni wrote:
> Package: lintian
> Version: 2.115.1
> Severity: normal
> 
> Hi, on a lintian output I saw:
> 
> W: xapps-common: unknown-locale-code ber [usr/share/locale/ber/]
> 
> but ber locale exists:
> https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?code_ID=54

thanks for your bug report. This brought up a quite diffcult question:
Which parts of ISO 639 are meant to be used for POSIX locales.


Summary / TL;DR
---------------

It is currently not clear if this is really false positive or a true
positive. It basically boils down to the question which parts of ISO
639 should be used for POSIX locales: ISO 639-2, 639-3, 639-5 or a
combination of these? Lintian currently uses only ISO 639-3 — which
includes probably all of ISO 639-1 and most but not all of ISO 639-2.

And ISO 639-3 doesn't currently doesn't include "ber" (which is a
group of languages and not a language) but includes e.g. "jbe"
("Judeo-Berber"). ISO 639-2 and 639-5 though do include "ber".

For locales, POSIX refers to ISO/IEC 15897. And that one refers to ISO
639, but not explicitly to any part of it.

I came to the conclusion to expand this lintian check from only using
ISO 639-3 to also ISO 639-2 (which both also include ISO 639-1) and
hence make "ber" a locale accepted by lintian.

For a more detailed reasoning and the used sources, see below.


Long Story and Reasoning
------------------------

It seems as if Lintian only takes ISO 639-3 into account, not ISO
639-2. And https://iso639-3.sil.org/about says

  At the core of ISO 639-3 are the individual languages already
  accounted for in ISO 639-2. The large number of living languages in
  the initial inventory of ISO 639-3 beyond those already included in
  ISO 639-2 was derived primarily from […]

For me, it's currently not clear if that means that all languages in
ISO 639-2 are literally included in ISO 639-3 (i.e. ISO 639-3 is a
superset of ISO 639-2) or if ISO 639-3 is just an addition to
ISO 639-2 (i.e. the languages in ISO 639-2 and ISO 639-3 are
disjunct).

In the former case, this would be a bug in the package iso-codes (or
isoquery, depending on the data model; see below), in the latter case
this would be a bug in Lintian as it would need to take ISO 639-2 into
account here, too.

And "ber" is in ISO 639-2 since 2009 according to
https://www.loc.gov/standards/iso639-2/php/code_changes_bycode.php?code_ID=54

And "isoquery" also finds it in ISO 639-2, but not ISO 639-3:

  → isoquery -i 639-3 ber
  isoquery: The code "ber" is not defined in ISO 639-3.
  → isoquery -i 639-2 ber
  ber                     Berber languages
  →

And indeed, the word "ber" can only be found in the ISO 639-3 and ISO
639-5 datasets:

  → fgrep -wA1 ber /usr/share/iso-codes/json/iso_639-?.json
  /usr/share/iso-codes/json/iso_639-2.json:      "alpha_3": "ber",
  /usr/share/iso-codes/json/iso_639-2.json-      "name": "Berber languages"
  --
  /usr/share/iso-codes/json/iso_639-5.json:      "alpha_3": "ber",
  /usr/share/iso-codes/json/iso_639-5.json-      "name": "Berber languages"

ISO 639-5 is also said to be a "supplement" according to
https://www.loc.gov/standards/iso639-5/

Then again there are three letter codes for languages like "deu" (and
its alias "ger") for German which are in both, ISO 639-2 as well as
ISO 639-3, but not ISO 639-5. For me, this only adds to the confusion.

Relevant file is lib/Lintian/Check/Files/Locales.pm at lines 69 to 90
as of today:

     69 has ISO639_3_by_alpha3 => (
     70     is => 'rw',
     71     lazy => 1,
     72     default => sub {
     73         my ($self) = @_;
     74 
     75         local $ENV{LC_ALL} = 'C';
     76 
     77         my $bytes = 
path('/usr/share/iso-codes/json/iso_639-3.json')->slurp;
     78         my $json = decode_json($bytes);
     79 
     80         my %iso639_3;
     81         for my $entry (@{$json->{'639-3'}}) {
     82 
     83             my $alpha_3 = $entry->{alpha_3};
     84 
     85             $iso639_3{$alpha_3} = $entry;
     86         }
     87 
     88         return \%iso639_3;
     89     }
     90 );

Lines 100 to 122 though give a hint that the author of this code
thinks that ISO 639-3 is just a union of ISO 639-1 and ISO 639-2:

    100         my %CODES;
    101         for my $entry (values %{$self->ISO639_3_by_alpha3}) {
                                               ^^^^^^^^
    102 
    103             my $type = $entry->{type};
    104 
    105             # 
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=692548#10
    106             next
    107               if $type eq $RESERVED || $type eq $SPECIAL;
    108 
    109             # also have two letters, ISO 639-1
                                             ^^^^^^^^^
    110             my $two_letters;
    111             $two_letters = $entry->{alpha_2}
    112               if exists $entry->{alpha_2};
    113 
    114             $CODES{$two_letters} = $EMPTY
    115               if length $two_letters;
    116 
    117             # three letters, ISO 639-2
                                     ^^^^^^^^^
    118             my $three_letters = $entry->{alpha_3};
    119 
    120             # a value indicates that two letters are preferred
    121             $CODES{$three_letters} = $two_letters || $EMPTY;
    122         }

Which is clearly wrong as it has been proven above that not all
languages from ISO 639-2 are included in ISO 639-3, at least not in
the datasets as present in Debian's iso-codes package.

So this needs to be changed in Lintian anyway.

The again, Wikipedia claims in https://en.wikipedia.org/wiki/ISO_639
that "Individual languages in Part 2 always have a code in Part 3
(only the Part 2 terminology code is reused there)". Another hint was
"Macrolanguages (Part 3)" which pointed to
https://en.wikipedia.org/wiki/ISO_639_macrolanguage

There it is cited: "According to the ISO,

    Some existing code elements in ISO 639-2, and the corresponding
    code elements in ISO 639-1, are designated in those parts of ISO
    639 as individual language code elements, yet are in a one-to-many
    relationship with individual language code elements in [ISO
    639-3]. For purposes of [ISO 639-3], they are considered to be
    macrolanguage code elements.

    — ISO 639-3: Relationship between ISO 639-3 and the other parts of
      ISO 639

And indeed, "ber" in ISO 639-2 is not the "Berber language" but
"Berber languages" (i.e. plural). In ISO 639-3 there's only
"Judeo-Berber" with the three-letter code "jbe". See also
https://en.wikipedia.org/wiki/Berber_languages and
https://en.wikipedia.org/wiki/Judeo-Berber_language for this specific
case.

>From my point of view, these macrolanguages or language groups likely
shouldn't be used in locales as according to
https://en.wikipedia.org/wiki/Locale_(computer_software), "a locale is
a set of parameters that defines the user's language, region and any
special variant preferences that the user wants to see in their user
interface". But I may be wrong…

It also says that "on POSIX platforms such as Unix, Linux and others,
locale identifiers are defined by ISO/IEC 15897".
https://www.open-std.org/jtc1/sc22/wg20/docs/n610.pdf, the free draft
of that standard though only refers to ISO 639, not to any of its
parts. But it also refers to "natural languages", but that just
separates the term "language" from artificial and constructed
languages.

Looking at the publishing date of ISO/IEC 15897 (according to
https://en.wikipedia.org/wiki/ISO/IEC_15897) and the ISO 639 parts
(according to
https://en.wikipedia.org/wiki/ISO_639#Current_and_historical_parts_of_the_standard),
we though can assume that at least the original publication of
ISO/IEC_15897 (1999) only could refer to ISO 639-1 ("1967 (as ISO
639)") and 639-2 (1998) as later parts weren't published yet. The
second edition of ISO/IEC_15897 was published in 2011 and hence could
have been aware of ISO 639-3 which was first published in 2007 as well
as ISO 639-5 which was first published in 2008.

That "1967 (as ISO 639)" also could be a hint that only ISO 639-1 was
meant, but since the two letter code of ISO 639-1 it is clear that
this won't suffice and at least some three letter locales need to be
present. The question is still: Which ISO 639 parts and which possibly
not?

Looking again at
https://en.wikipedia.org/wiki/ISO_639#Current_and_historical_parts_of_the_standard
there are also the number of languages included in each standard
listed:

  ISO 639-1 (two-letter only):  184
  ISO 639-2 (three-letter):     502
  ISO 639-3 (three-letter):    7893
  ISO 639-5 (three-letter):     115 (of which 29 are also in ISO 639-2)

It also seems that the union of ISO 639-3 and 639-5 is a superset of
ISO 639-2, but not ISO 639-1 alone.


Conclusion
----------

So my current gut feeling and its reasoning is the following:

* POSIX standard doesn't help us here (once again) because it is
  ambiguous (once again).

* ISO 639-1 and ISO 639-2 should be included because they're the most
  common ones, despite they seem to include some macrolanguages or
  language families.

* ISO 639-1 seems to be a subset of ISO 639-2 and iso-codes doesn't
  even include data files for ISO 639-1. So I consider just
  /usr/share/iso-codes/json/iso_639-2.json as source for ISO 639-1 as
  well as ISO 639-2.

* ISO 639-3 covers most languages but neither macrolanguages nor
  language families and hence should be included, too.

* ISO 639-5 only includes language families and groups and hence
  should _not_ be included.

If anyone has a different opinion on this topic, please speak up (and
preferably also explain why :-).

But actually there are only two other options which I consider to be
feasible:

* Keep ISO 639-3 as only source for valid locales. (Which would make
  this issue a true positive.)

* Allow any (non-withdrawn) ISO 639 part as source for a valid locale
  name, i.e. use ISO 639-2, 639-3 and 639-5.

                Regards, Axel
-- 
 ,''`.  |  Axel Beckert <a...@debian.org>, https://people.debian.org/~abe/
: :' :  |  Debian Developer, ftp.ch.debian.org Admin
`. `'   |  4096R: 2517 B724 C5F6 CA99 5329  6E61 2FF9 CD59 6126 16B5
  `-    |  1024D: F067 EA27 26B9 C3FC 1486  202E C09E 1D89 9593 0EDE

Reply via email to