Bug#1013946: lintian: wrongly report unknown-locale-code ber

Russ Allbery Mon, 27 Jun 2022 17:33:20 -0700

Axel Beckert <a...@debian.org> writes:

> Anyway, JFTR: I just looked at how lintian in Debian Stable (i.e.
> 2.104.0 in Bullseye) does the locale code lookup. It had it's own data
> file for that (and hence now using iso-codes is good as it is no more
> duplicating these 33kB of data) and that file
> (/usr/share/lintian/data/files/locale-codes) states:


>   # List of locale codes.  This is derived from the ISO 639-1, ISO
>   # 639-2, and ISO 639-3 standards.

> And indeed, "ber" was in that file.

> So previously lintian did use ISO 639-1, 639-2 and 639-3.

> So using just ISO 639-3 was either an accident, on purpose or a
> regression and has been introduced when lintian was switching to
> iso-code's files as data source in commit
> https://salsa.debian.org/lintian/lintian/-/commit/fcaded19

What I think I managed to reconstruct from reading about this [1] is that
639-2 was the original work to supplement 639-1 (which is limited to
two-letter codes and omits a lot of smaller languages).  However, ISO
639-2 also assigned codes to language families and some other things,
wherease ISO 639-3 is limited to just languages and the families moved to
ISO 639-5.

[1] https://en.wikipedia.org/wiki/ISO_639-2 mostly.

Looking at ISO 639-5, I think a lot of those wouldn't make sense as
translations.  It has a lot of things like zhx (Chinese family), cpe (all
English-based creoles), or grk (Greek languages).  Some of those (cpe for
example) also appear in ISO 639-2, which implies to me that 639-2 is a bit
too broad for useful translations.

That said, reading more about the Berber languages [2], I understand how
this happened with this group in particular.  Specifically, this:

    A listing of the other Berber languages is complicated by their
    closeness; there is little distinction between language and
    dialect. The primary difficulty of subclassification, however, lies in
    the eastern Berber languages, where there is little agreement.

probably implies that the languages are sufficiently mutually
comprehensible that it may make sense to translate something to "Berber"
without specifying a specific language in the family.  (I could imagine
that sometimes it may avoid political and social issues to not specify a
specific language from the family, although I have no idea if that's the
case here.)

[2] https://en.wikipedia.org/wiki/Berber_languages

However, that wouldn't really make sense for "cpe" (creoles are very
different from each other even if they're English-based).  So that still
feels to me like it leans away from including everything in 639-2.

I think I may be talking myself into adding an exception list of non-639-3
language codes that nonetheless are used by translators.  But that's an
ongoing maintenance burden, so maybe that's not the right move either.

The alternate argument is that Lintian's check is really mostly there to
catch typos, and maybe we should assume anyone who uses any 639-2 or 639-3
code knows what they're doing.  And since that's what Lintian used to do,
it has the benefit of fixing a regression and I don't think anyone was
complaining about the breadth of the previous list, just the duplication
of information.

So in short, I think I talked myself back around to your solution.  :)
(Maybe all of this can be captured in comments for the next poor
maintainer who has to try to understand what's going on.)

-- 
Russ Allbery (r...@debian.org)              <https://www.eyrie.org/~eagle/>

Bug#1013946: lintian: wrongly report unknown-locale-code ber

Reply via email to