On Sun, Apr 12, 2026 at 7:46 PM Keith Thompson
<[email protected]> wrote:
> The issue for me is that some plain-text files are more difficult to
> read *in my environment* because they use Latin-1 rather than UTF-8
> encoding.
There's no consistency to the encodings of non-ASCII files in the
groff tree. In the top level alone, NEWS is in Latin-1 while LICENSES
and HACKING are in UTF-8. The historical ChangeLog.* files (that
aren't pure ASCII) are about evenly split between UTF-8 and Latin-1.
However, most of the 8-bit non-UTF-8 files have a "coding:" line
revealing their encoding, usually at the end of the file in an "Editor
settings" block (see, e.g., NEWS and doc/groff.texi.in). The
exceptions probably ought to be fixed:
find . -type f -exec file {} \; | fgrep 8859 | cut -f1 -d: | xargs
fgrep -c coding: | grep ':0$'
> Most tools on my system (Ubuntu 24.04) are configured
> to use UTF-8 by default. I've configured other tools to do the same.
You should probably configure your tools to use the encoding of the
actual file being opened. The world seems to be converging on UTF-8,
but it's not there yet, and downloaded files can be in myriad
encodings. Your tools will serve you better if you don't tell them
everything is in UTF-8, because that's simply not true.
> The groff
> source files are somewhat unusual in using Latin-1 rather than UTF-8.
groff itself is somewhat unusual in that it can read ISO-8859 input
natively but requires preconv to handle UTF-8. (There is a long, slow
drive to change this: see http://savannah.gnu.org/bugs/?40720 and its
dependencies.) So there's some logic in limiting its text files to
Latin-1 where possible.
But even the "where possible" part isn't consistent. Of the
aforementioned UTF-8 top-level files, while LICENSES contains
characters unavailable in Latin-1, HACKING contains only characters
that *are* available in Latin-1.