user lintian-ma...@debian.org usertag 1013946 + false-positive unknown-locale-code tag 1013946 + confirmed retitle 1013946 lintian: [FP] Wrongly reports unknown-locale-code "ber" (POSIX locales: ISO 639-2 vs 639-3 vs 639-5) kthxbye
Hi Fabio, Fabio Fantoni wrote: > Package: lintian > Version: 2.115.1 > Severity: normal > > Hi, on a lintian output I saw: > > W: xapps-common: unknown-locale-code ber [usr/share/locale/ber/] > > but ber locale exists: > https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?code_ID=54 thanks for your bug report. This brought up a quite diffcult question: Which parts of ISO 639 are meant to be used for POSIX locales. Summary / TL;DR --------------- It is currently not clear if this is really false positive or a true positive. It basically boils down to the question which parts of ISO 639 should be used for POSIX locales: ISO 639-2, 639-3, 639-5 or a combination of these? Lintian currently uses only ISO 639-3 — which includes probably all of ISO 639-1 and most but not all of ISO 639-2. And ISO 639-3 doesn't currently doesn't include "ber" (which is a group of languages and not a language) but includes e.g. "jbe" ("Judeo-Berber"). ISO 639-2 and 639-5 though do include "ber". For locales, POSIX refers to ISO/IEC 15897. And that one refers to ISO 639, but not explicitly to any part of it. I came to the conclusion to expand this lintian check from only using ISO 639-3 to also ISO 639-2 (which both also include ISO 639-1) and hence make "ber" a locale accepted by lintian. For a more detailed reasoning and the used sources, see below. Long Story and Reasoning ------------------------ It seems as if Lintian only takes ISO 639-3 into account, not ISO 639-2. And https://iso639-3.sil.org/about says At the core of ISO 639-3 are the individual languages already accounted for in ISO 639-2. The large number of living languages in the initial inventory of ISO 639-3 beyond those already included in ISO 639-2 was derived primarily from […] For me, it's currently not clear if that means that all languages in ISO 639-2 are literally included in ISO 639-3 (i.e. ISO 639-3 is a superset of ISO 639-2) or if ISO 639-3 is just an addition to ISO 639-2 (i.e. the languages in ISO 639-2 and ISO 639-3 are disjunct). In the former case, this would be a bug in the package iso-codes (or isoquery, depending on the data model; see below), in the latter case this would be a bug in Lintian as it would need to take ISO 639-2 into account here, too. And "ber" is in ISO 639-2 since 2009 according to https://www.loc.gov/standards/iso639-2/php/code_changes_bycode.php?code_ID=54 And "isoquery" also finds it in ISO 639-2, but not ISO 639-3: → isoquery -i 639-3 ber isoquery: The code "ber" is not defined in ISO 639-3. → isoquery -i 639-2 ber ber Berber languages → And indeed, the word "ber" can only be found in the ISO 639-3 and ISO 639-5 datasets: → fgrep -wA1 ber /usr/share/iso-codes/json/iso_639-?.json /usr/share/iso-codes/json/iso_639-2.json: "alpha_3": "ber", /usr/share/iso-codes/json/iso_639-2.json- "name": "Berber languages" -- /usr/share/iso-codes/json/iso_639-5.json: "alpha_3": "ber", /usr/share/iso-codes/json/iso_639-5.json- "name": "Berber languages" ISO 639-5 is also said to be a "supplement" according to https://www.loc.gov/standards/iso639-5/ Then again there are three letter codes for languages like "deu" (and its alias "ger") for German which are in both, ISO 639-2 as well as ISO 639-3, but not ISO 639-5. For me, this only adds to the confusion. Relevant file is lib/Lintian/Check/Files/Locales.pm at lines 69 to 90 as of today: 69 has ISO639_3_by_alpha3 => ( 70 is => 'rw', 71 lazy => 1, 72 default => sub { 73 my ($self) = @_; 74 75 local $ENV{LC_ALL} = 'C'; 76 77 my $bytes = path('/usr/share/iso-codes/json/iso_639-3.json')->slurp; 78 my $json = decode_json($bytes); 79 80 my %iso639_3; 81 for my $entry (@{$json->{'639-3'}}) { 82 83 my $alpha_3 = $entry->{alpha_3}; 84 85 $iso639_3{$alpha_3} = $entry; 86 } 87 88 return \%iso639_3; 89 } 90 ); Lines 100 to 122 though give a hint that the author of this code thinks that ISO 639-3 is just a union of ISO 639-1 and ISO 639-2: 100 my %CODES; 101 for my $entry (values %{$self->ISO639_3_by_alpha3}) { ^^^^^^^^ 102 103 my $type = $entry->{type}; 104 105 # https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=692548#10 106 next 107 if $type eq $RESERVED || $type eq $SPECIAL; 108 109 # also have two letters, ISO 639-1 ^^^^^^^^^ 110 my $two_letters; 111 $two_letters = $entry->{alpha_2} 112 if exists $entry->{alpha_2}; 113 114 $CODES{$two_letters} = $EMPTY 115 if length $two_letters; 116 117 # three letters, ISO 639-2 ^^^^^^^^^ 118 my $three_letters = $entry->{alpha_3}; 119 120 # a value indicates that two letters are preferred 121 $CODES{$three_letters} = $two_letters || $EMPTY; 122 } Which is clearly wrong as it has been proven above that not all languages from ISO 639-2 are included in ISO 639-3, at least not in the datasets as present in Debian's iso-codes package. So this needs to be changed in Lintian anyway. The again, Wikipedia claims in https://en.wikipedia.org/wiki/ISO_639 that "Individual languages in Part 2 always have a code in Part 3 (only the Part 2 terminology code is reused there)". Another hint was "Macrolanguages (Part 3)" which pointed to https://en.wikipedia.org/wiki/ISO_639_macrolanguage There it is cited: "According to the ISO, Some existing code elements in ISO 639-2, and the corresponding code elements in ISO 639-1, are designated in those parts of ISO 639 as individual language code elements, yet are in a one-to-many relationship with individual language code elements in [ISO 639-3]. For purposes of [ISO 639-3], they are considered to be macrolanguage code elements. — ISO 639-3: Relationship between ISO 639-3 and the other parts of ISO 639 And indeed, "ber" in ISO 639-2 is not the "Berber language" but "Berber languages" (i.e. plural). In ISO 639-3 there's only "Judeo-Berber" with the three-letter code "jbe". See also https://en.wikipedia.org/wiki/Berber_languages and https://en.wikipedia.org/wiki/Judeo-Berber_language for this specific case. >From my point of view, these macrolanguages or language groups likely shouldn't be used in locales as according to https://en.wikipedia.org/wiki/Locale_(computer_software), "a locale is a set of parameters that defines the user's language, region and any special variant preferences that the user wants to see in their user interface". But I may be wrong… It also says that "on POSIX platforms such as Unix, Linux and others, locale identifiers are defined by ISO/IEC 15897". https://www.open-std.org/jtc1/sc22/wg20/docs/n610.pdf, the free draft of that standard though only refers to ISO 639, not to any of its parts. But it also refers to "natural languages", but that just separates the term "language" from artificial and constructed languages. Looking at the publishing date of ISO/IEC 15897 (according to https://en.wikipedia.org/wiki/ISO/IEC_15897) and the ISO 639 parts (according to https://en.wikipedia.org/wiki/ISO_639#Current_and_historical_parts_of_the_standard), we though can assume that at least the original publication of ISO/IEC_15897 (1999) only could refer to ISO 639-1 ("1967 (as ISO 639)") and 639-2 (1998) as later parts weren't published yet. The second edition of ISO/IEC_15897 was published in 2011 and hence could have been aware of ISO 639-3 which was first published in 2007 as well as ISO 639-5 which was first published in 2008. That "1967 (as ISO 639)" also could be a hint that only ISO 639-1 was meant, but since the two letter code of ISO 639-1 it is clear that this won't suffice and at least some three letter locales need to be present. The question is still: Which ISO 639 parts and which possibly not? Looking again at https://en.wikipedia.org/wiki/ISO_639#Current_and_historical_parts_of_the_standard there are also the number of languages included in each standard listed: ISO 639-1 (two-letter only): 184 ISO 639-2 (three-letter): 502 ISO 639-3 (three-letter): 7893 ISO 639-5 (three-letter): 115 (of which 29 are also in ISO 639-2) It also seems that the union of ISO 639-3 and 639-5 is a superset of ISO 639-2, but not ISO 639-1 alone. Conclusion ---------- So my current gut feeling and its reasoning is the following: * POSIX standard doesn't help us here (once again) because it is ambiguous (once again). * ISO 639-1 and ISO 639-2 should be included because they're the most common ones, despite they seem to include some macrolanguages or language families. * ISO 639-1 seems to be a subset of ISO 639-2 and iso-codes doesn't even include data files for ISO 639-1. So I consider just /usr/share/iso-codes/json/iso_639-2.json as source for ISO 639-1 as well as ISO 639-2. * ISO 639-3 covers most languages but neither macrolanguages nor language families and hence should be included, too. * ISO 639-5 only includes language families and groups and hence should _not_ be included. If anyone has a different opinion on this topic, please speak up (and preferably also explain why :-). But actually there are only two other options which I consider to be feasible: * Keep ISO 639-3 as only source for valid locales. (Which would make this issue a true positive.) * Allow any (non-withdrawn) ISO 639 part as source for a valid locale name, i.e. use ISO 639-2, 639-3 and 639-5. Regards, Axel -- ,''`. | Axel Beckert <a...@debian.org>, https://people.debian.org/~abe/ : :' : | Debian Developer, ftp.ch.debian.org Admin `. `' | 4096R: 2517 B724 C5F6 CA99 5329 6E61 2FF9 CD59 6126 16B5 `- | 1024D: F067 EA27 26B9 C3FC 1486 202E C09E 1D89 9593 0EDE