On Thu, Sep 26, 2013 at 05:55:52PM -0400, Federico Lucifredi wrote: > On Sep 26, 2013, at 4:32 AM, Colin Watson <cjwat...@debian.org> wrote: > > Or use man-db instead, which is much smarter about character encodings > > and handles your situation out of the box with no configuration > > required; I just tested it to confirm this. > > Since both pagers use troff in the back-end, this is just a matter of > making the correct use of it. Let me check if I am doing something > funky upstream, with Werner's information it should be relatively > simple.
Well, I did the work in man-db; starting from a relatively similar baseline it took me years to get it up to a point I considered acceptable encoding handling, and it was my main non-work project for much of that time. Admittedly most of that was before preconv existed and before groff gained proper Unicode support upstream, which does help, but only up to a point. Of course I don't want to discourage you from improving man similarly, but I don't think you should be under the false impression that it's a simple problem when you're still at the point where users often have to manually edit man.conf to make it do the right thing even for single-encoding collections, and there's no support at all for mixed-encoding collections, which is an important feature for distributions. man is still very much in a world where the user's encoding is expected to match the encoding of all the manual pages they want to read, which has not been a safe assumption for many years now. One important element, although by no means the whole problem, is that most manual pages do not have a coding tag, and a considerable amount of automatic detection is necessary to do a good job in practice. Obviously we can't autodetect all encodings, but with knowledge of the language - which man generally has - you can usually reduce the likely options to UTF-8 and a single legacy encoding for any given language, and it is normally possible to tell the difference between those given a large enough amount of text. This is why I wrote manconv, and what much of man-db/lib/encodings.c is for. Even after you've done that, getting exactly the right sequence of filters is quite a delicate matter. (And, if you're going to improve encoding handling, you should arguably start by throwing catgets in the bin and switching to gettext, so that your translated messages work properly. catgets is just not fit for purpose in a world with more than one possible encoding for a given language, and this has been the source of real-world bugs in man that have gone unaddressed for a long time. Replacing catgets with gettext was one of the first things I did when I took over man-db back in 2001. According to https://bugs.gentoo.org/show_bug.cgi?id=93664 you've been saying that you're going to address the message translation problem since at least 2005; killing off catgets is the correct way to do that.) Maybe it's time to consolidate efforts rather than spending what's likely to be a lot of effort catching up. I've tried to ensure that man-db has all the important features of man, assisted particularly by bug reports from Gentoo and Fedora users switching to man-db; are there any places where you still feel it's objectively lacking, and if so would you be willing to help me out upstream to fill the gaps? The only substantial missing feature I'm aware of is man2html, which I kind of feel is better off as a separate package anyway; indeed "groff -Thtml" or w3mman2html.cgi arguably do better jobs nowadays. Cheers, -- Colin Watson [cjwat...@debian.org]