Bug#519095: man-db: Improving man support for pages iso-8859-* encoded

Colin Watson Tue, 10 Mar 2009 16:43:00 -0700

clone 519095 -1
user man...@packages.debian.org
usertags 519095 target-2.5.5
tags 519095 fixed-upstream
reassign -1 manpages
retitle -1 manpages: state encoding of iso-8859-* pages
thanks

[Dear manpages maintainer: please read down for the part that affects
you.]

On Tue, Mar 10, 2009 at 01:16:18PM +0100, Hugo Herbelin wrote:
> My primary wish was to be able to correctly display the pages
> iso_8859-* and I end up with a suggestion for better supporting
> all pages encoded in one of the iso-8859-X coding systems.

So, this is really pretty complicated. I agree with almost all of your
analysis, but let me try to explain a bit further.

> Here were my successive experiences for displaying, e.g., the
> iso_8859-15 man page:
> 
> * Bad solutions *
> 
> - If I set my locale to utf8, I see all non-ascii characters in the
>   iso_8859-* pages as if they were iso-8859-1 characters. As reported
>   by "man -d", the information in the pipeline that is relevant to the
>   encoding is:
> 
>   manconv -f UTF-8:ISO-8859-1 -t ISO-8859-1//IGNORE | nroff -mandoc -Tutf8
> 
>   and indeed, nroff assumes having latin1 as default input and utf8 in
>   output.

Correct.

> - If I set my locale to iso885...@euro, I see "?" for the euro sign
>   and "1/4", "1/2" and "3/4" for the oe ligature and Y with
>   diaeresis. Indeed, the pipeline is
> 
>   manconv -f UTF-8:ISO-8859-1 -t ISO-8859-1//IGNORE | nroff -mandoc -Tlatin1 
> | iconv -c -f ISO-8859-1 -t ISO-8859-15//TRANSLIT 
> 
>   which does as if the page were in ISO-8859-1 (while in fact it is in
>   ISO-8859-15) and translate what it thinks are ISO-8859-1 chars into
>   valid ISO-8859-15 sequences (the "¤" currency sign becomes "?"
>   because it has no equivalent and the "¼", "½", "¾" characters become
>   "1/4" and so on).

Correct. If man treated a file on the filesystem as being in a different
encoding just because you were using a different locale, that would be a
bug in itself; files don't change encoding just because you set an
environment variable.

That said, using the latin1 device and then recoding to ISO-8859-15 is
not really the best solution. I think it might be better to use the utf8
device and then recode to ISO-8859-15 from there. This doesn't entirely
fix the problem, though; see below.

> * Better solutions *
> 
> In a second step, I tried to move the page iso_8859-* to a directory
> whose name tells what the encoding is (I typically move the
> iso_8859-15 page to a directory named "en.ISO8859-15/man7"). The pipeline
> seems to become better as we now obtain:

This is one approach, but a cleaner one would be to change the first
line of iso-8859-15.7.gz to:

  '\" t -*- coding: ISO-8859-15 -*-

(See manconv(1) for documentation of this.) Although you won't see
evidence of this in the debugging output, this will cause manconv to
ignore the input encoding(s) given to it and instead assume ISO-8859-15.
Although see my comments below about bugs in this ...

I've cloned this bug and reassigned the clone to manpages, since,
regardless of any other work done in this area, any English manual pages
that are not encoded in ISO-8859-1 or UTF-8 should state an explicit
encoding using the above mechanism.

> - with a utf8 locale:
> 
>   page_encoding = ISO-8859-15
>   source_encoding = ISO-8859-1
>   roff_encoding = ISO-8859-1
>   output_encoding = UTF-8
>   pipeline is: manconv -f UTF-8:ISO-8859-15 -t ISO-8859-1//IGNORE | nroff 
> -mandoc -Tutf8
> 
> - with an iso885...@euro locale:
> 
>   page_encoding = ISO-8859-15
>   source_encoding = ISO-8859-1
>   roff_encoding = ISO-8859-1
>   output_encoding = ISO-8859-1
>   pipeline is: manconv -f UTF-8:ISO-8859-15 -t ISO-8859-1//IGNORE | nroff 
> -mandoc -Tlatin1 | iconv -c -f ISO-8859-1 -t ISO-8859-15//TRANSLIT
> 
> What is better, is that man has recognized that the encoding of the
> page is iso-8859-15 (based on the directory name) but it has failed to
> to propagate this information when it turned to find an encoding that
> nroff supports. Something is strange there regarding the respective
> roles of the "source" and "page" encodings in the calls to manconv and
> roff.
> 
> From what I understand (but I'm uncertain), nroff does not support
> multibyte characters and hence, pages have to be converted to
> single-byte characters using the ascii8 device (it seems there is
> something special for east-asia languages but I don't understand well
> how it works). The problem seems to be that the single-byte encoding
> used to call nroff forgets about the encoding mentioned in the
> directory name and only keeps the language part of the directory name,
> then reassigning to each language a canonical default encoding. This
> strategy would be good for pages encoded in utf8: since nroff does not
> support utf8, we assume that, say, a Polish page in utf8 can always be
> converted to the single-byte iso-8859-2 encoding. But this strategy
> losses information when we already know that the page is encoded in a
> single-byte encoding.

I agree that the recoding from one legacy encoding to another loses
information, and this is definitely a bug.

It's important to remember that, with some exceptions, the current
version of groff in Debian cannot really be told to use a different
input encoding, which is where a lot of this weirdness comes from. It's
not just about single-byte vs. multibyte; with the exception of some
hacks for CJK (the nippon device), and the awful, awful ascii8 hack,
groff always assumes that its input is ISO-8859-1.

This has been fixed upstream by the introduction of the preconv
preprocessor, which will allow man to feed in any input encoding it
likes and have preconv convert it to a notation involving Unicode
codepoints that the groff core can understand. man-db is already
prepared to use this once it's available. However, there is one last
significant blocker to upgrading the Debian package, namely the
introduction of character class support so that the new groff can format
CJK text reasonably without the massive non-forward-portable Debian
patch. I'm working on this on and off at the moment.

Now, we can work around this somewhat by using the awful, awful hack I
mentioned above: the purpose of the ascii8 device is that its output
encoding is always the same as its input encoding (so far from
converting multibyte characters to single-byte characters, the ascii8
device exists to perform no conversion at all). This is typographically
unsound because groff is not supposed to just pass through character
data, but also to interpret it (e.g. hyphenation) and unless it knows
what characters are which it can't do its job properly. Nevertheless, in
the case of manual pages the consequences are not too bad, so this will
do as a workaround for the time being.

Using the ascii8 device for pages declared as ISO-8859-15 in their
preprocessor line breaks in man-db 2.5.4-1 for the following reasons:

  * manconv doesn't spot the -*- coding -*- line, because zsoelim puts a
    ".lf 1 -" line number marker before it. I've fixed this upstream by
    arranging for zsoelim to put the .lf request after any leading
    comment line.

  * The -*- coding -*- line is only read by manconv, not man itself.
    Thus, man recodes to ISO-8859-1 unnecessarily when it should realise
    that it needs to just use ISO-8859-15 all the way (with ascii8).
    I've fixed this upstream.

> My suggestions then are:
> 
> - Change the definition of "source encoding" so that if the language
>   directory name already mentions a single-byte encoding (say one of
>   the iso-8859-* encodings), it considers it to be the source encoding
>   and looks for a language-based canonical single-byte encoding (table
>   directory_table in file encodings.c of the man package) only if the
>   language directory name tells it is an UTF-8 page.

I think that makes sense on the grounds that recoding between legacy
encodings tends to do more harm than good, and have implemented this.

> - Move the English-written pages using iso-8859-X encodings in
>   directories named en.ISO8859-X (this is about the manpages package).

As mentioned above, I think this is better done by way of a preprocessor
encoding declaration, assuming the fixes I've applied upstream and
intend to backport to Debian. That's neater than creating new
directories for a small number of pages.

Tue Mar 10 23:24:27 GMT 2009  Colin Watson  <cjwat...@debian.org>

        Fix handling of pages that declare a non-default encoding in their
        preprocessor lines. Thanks to Hugo Herbelin for some of the ideas
        here (Debian bug #519095).

        * src/encodings.c (get_source_encoding): Note that this function
          should only be called if the page encoding is UTF-8. Add another
          example.
        * src/manconv.c (check_preprocessor_encoding): Move to ...
        * src/encodings.c (check_preprocessor_encoding): ... here.
        * src/encodings.h (check_preprocessor_encoding): Add prototype.
        * src/man.c (make_roff_command): Use preprocessor-declared encoding
          as page_encoding if known. Set source_encoding to page_encoding
          unless the latter is UTF-8.
        * src/Makefile.am (manconv_SOURCES): Add encodings.c.

        * src/encodings.c (charset_table): Use ISO-8859-15 -> latin1 entry
          only in the !MULTIBYTE_GROFF case; true ISO-8859-15 pages are
          better handled using ascii8 or preconv if possible.

Tue Mar 10 14:11:14 GMT 2009  Colin Watson  <cjwat...@debian.org>

        * src/zsoelim.l (zsoelim_parse_file): Put the initial .lf request
          after any initial comment line, so that manconv can find encoding
          instructions more easily.

Thanks a lot,

-- 
Colin Watson                                       [cjwat...@debian.org]

-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Bug#519095: man-db: Improving man support for pages iso-8859-* encoded

Reply via email to