Re: [Groff] man file character encoding.

Colin Watson Fri, 27 Sep 2013 02:19:13 -0700

On Thu, Sep 26, 2013 at 05:55:52PM -0400, Federico Lucifredi wrote:
> On Sep 26, 2013, at 4:32 AM, Colin Watson <cjwat...@debian.org> wrote:
> > Or use man-db instead, which is much smarter about character encodings
> > and handles your situation out of the box with no configuration
> > required; I just tested it to confirm this.
> 
> Since both pagers use troff in the back-end, this is just a matter of
> making the correct use of it.  Let me check if I am doing something
> funky upstream, with Werner's information it should be relatively
> simple.


Well, I did the work in man-db; starting from a relatively similar
baseline it took me years to get it up to a point I considered
acceptable encoding handling, and it was my main non-work project for
much of that time.  Admittedly most of that was before preconv existed
and before groff gained proper Unicode support upstream, which does
help, but only up to a point.

Of course I don't want to discourage you from improving man similarly,
but I don't think you should be under the false impression that it's a
simple problem when you're still at the point where users often have to
manually edit man.conf to make it do the right thing even for
single-encoding collections, and there's no support at all for
mixed-encoding collections, which is an important feature for
distributions.  man is still very much in a world where the user's
encoding is expected to match the encoding of all the manual pages they
want to read, which has not been a safe assumption for many years now.

One important element, although by no means the whole problem, is that
most manual pages do not have a coding tag, and a considerable amount of
automatic detection is necessary to do a good job in practice.
Obviously we can't autodetect all encodings, but with knowledge of the
language - which man generally has - you can usually reduce the likely
options to UTF-8 and a single legacy encoding for any given language,
and it is normally possible to tell the difference between those given a
large enough amount of text.  This is why I wrote manconv, and what much
of man-db/lib/encodings.c is for.  Even after you've done that, getting
exactly the right sequence of filters is quite a delicate matter.

(And, if you're going to improve encoding handling, you should arguably
start by throwing catgets in the bin and switching to gettext, so that
your translated messages work properly.  catgets is just not fit for
purpose in a world with more than one possible encoding for a given
language, and this has been the source of real-world bugs in man that
have gone unaddressed for a long time.  Replacing catgets with gettext
was one of the first things I did when I took over man-db back in 2001.
According to https://bugs.gentoo.org/show_bug.cgi?id=93664 you've been
saying that you're going to address the message translation problem
since at least 2005; killing off catgets is the correct way to do that.)

Maybe it's time to consolidate efforts rather than spending what's
likely to be a lot of effort catching up.  I've tried to ensure that
man-db has all the important features of man, assisted particularly by
bug reports from Gentoo and Fedora users switching to man-db; are there
any places where you still feel it's objectively lacking, and if so
would you be willing to help me out upstream to fill the gaps?  The only
substantial missing feature I'm aware of is man2html, which I kind of
feel is better off as a separate package anyway; indeed "groff -Thtml"
or w3mman2html.cgi arguably do better jobs nowadays.

Cheers,

-- 
Colin Watson                                       [cjwat...@debian.org]

Re: [Groff] man file character encoding.

Reply via email to