Re: Mandatory LC_ALL=C.UTF-8 during package building

Simon McVittie Fri, 07 Jun 2024 09:20:58 -0700

On Fri, 07 Jun 2024 at 14:32:14 +0200, Guillem Jover wrote:
> I'm a non-native speaker, who has been involved
> in l10n for a long time, while at the same time I've pretty much
> always run my systems with either LANG=C.UTF-8 or before that LANG=C,
> LC_CTYPE=ca_ES.UTF-8 and LC_COLLATE=ca_ES.UTF-8.


So diagnostic messages in your non-English language are so important to
you that you ... set your locale environment variables to values that
result in you seeing diagnostic messages in English instead? I'm not
sure I understand the point you're making here :-)

If your point is that people-who-are-not-you place a higher value on
having diagnostic messages come out in their non-English language than
you personally do, then, yes, that's certainly a valid thing for those
people to want.

But I'm not sure that our current package set actually achieves that -
increasingly many of our packages overwrite the locale with "C.UTF-8"
in some layer of their build system, because they cannot guarantee that
the locale they inherit from the environment is anything reasonable (in
particular, it might be "C", which often breaks tools that want to work
with non-ASCII filenames, inputs or outputs). In the enumeration from
my earlier message, you want (1.), but increasingly, what you actually
get is (2½.) instead, and that results in neither you nor Giole getting
the results you would hope for.

The compromise that Alexandre suggested elsewhere in the thread -
requiring the locale to be *something* UTF-8, but leaving it unspecified
exactly which UTF-8 locale, so that a French-speaking developer can ask
for fr_FR.UTF-8 and get their compiler warnings in French - seems like
something that might actually give you what you want in more cases than
the status quo does? If we mandate a UTF-8 locale, then stack layers like
debhelper's meson plugin could probably stop forcing C.UTF-8.

> we make lots of l10n work rather pointless

Surely only if that l10n work was done on tools that are only ever run
from package builds, and never interactively? A lot of localization is
done for end-user-facing tools (GUI, TUI or CLI) which are never relevant
during a package build anyway.

Even for compilers and similar non-interactive development tools, if
a French-speaking developer runs gcc in the fr_FR.UTF-8 locale during
their upstream development, they'll still benefit from its warnings being
localized into French, even if they would never see those same warnings
during a Debian package build of the same software.

(Analogous: I similarly benefit from gcc having ANSI colour highlights
in its output, even though my Debian package build logs don't have those.)

> and if no one is running with different locales then l10n
> bugs might easily creep in

If no one is running (their interactive sessions) with a particular
locale, why do we even support that locale?

If a locale has users, and they find bugs, then of course those bugs are
something to be fixed (subject to triaging and prioritization, because
we have more bugs than time). But I'm not convinced that occasionally
doing package builds in arbitrary locales is something that will find
locale bugs more readily than real users' normal use of the software
that we ship.

The locale issues I've generally seen during package builds are more like
"I've set up this artificial situation, and now the consequences of what
I asked for are considered to be a bug", for instance "if I run this
tool that wants to output UTF-8 in an ASCII-only locale, it fails with
an error message" (well, of course it does, it's being put in a situation
where it can't do its job as-designed). Or building HTML documentation in
an arbitrary locale, and then having reproducible-builds act surprised
that one build mentions the French translation of "table of contents"
and the other mentions the German translation of "table of contents"
(well, of course it does - "you asked for it, you got it").

> I can understand though the sentiment
> of wanting to shrug this problem category off and wanting instead to
> sweep it under the carpet, but that has accessibility consequences.

I am not advocating sweeping this problem category under the carpet!
I'm just not convinced that saying "we support building any package
with an arbitrary locale at entry to the build system" is actually a
good way to detect the sorts of locale issues that cause the sorts of
concrete end-user-facing problems that have accessibility consequences.

If we want to run test-suites under multiple locales, then we should
maybe consider doing that, rather than using the locale of the build
system as a proxy for the (single) locale in which tests will be run for
this particular build. Saying "it's a bug if your test suite fails in
tr_TR.UTF-8" doesn't do anything to guarantee that anyone will actually
ever try that particular build scenario.

And, even if your test suite passes in tr_TR.UTF-8, that doesn't
necessarily mean that the right thing as expected by a Turkish speaker
is actually happening - as a non-Turkish-speaker, I'm certainly not
confident that I could write a unit test for whether dotted vs. dotless
i are being handled correctly, or even identify which component would
benefit from that unit test.

    smcv

Re: Mandatory LC_ALL=C.UTF-8 during package building

Reply via email to