clone 519095 -1 user man...@packages.debian.org usertags 519095 target-2.5.5 tags 519095 fixed-upstream reassign -1 manpages retitle -1 manpages: state encoding of iso-8859-* pages thanks
[Dear manpages maintainer: please read down for the part that affects you.] On Tue, Mar 10, 2009 at 01:16:18PM +0100, Hugo Herbelin wrote: > My primary wish was to be able to correctly display the pages > iso_8859-* and I end up with a suggestion for better supporting > all pages encoded in one of the iso-8859-X coding systems. So, this is really pretty complicated. I agree with almost all of your analysis, but let me try to explain a bit further. > Here were my successive experiences for displaying, e.g., the > iso_8859-15 man page: > > * Bad solutions * > > - If I set my locale to utf8, I see all non-ascii characters in the > iso_8859-* pages as if they were iso-8859-1 characters. As reported > by "man -d", the information in the pipeline that is relevant to the > encoding is: > > manconv -f UTF-8:ISO-8859-1 -t ISO-8859-1//IGNORE | nroff -mandoc -Tutf8 > > and indeed, nroff assumes having latin1 as default input and utf8 in > output. Correct. > - If I set my locale to iso885...@euro, I see "?" for the euro sign > and "1/4", "1/2" and "3/4" for the oe ligature and Y with > diaeresis. Indeed, the pipeline is > > manconv -f UTF-8:ISO-8859-1 -t ISO-8859-1//IGNORE | nroff -mandoc -Tlatin1 > | iconv -c -f ISO-8859-1 -t ISO-8859-15//TRANSLIT > > which does as if the page were in ISO-8859-1 (while in fact it is in > ISO-8859-15) and translate what it thinks are ISO-8859-1 chars into > valid ISO-8859-15 sequences (the "¤" currency sign becomes "?" > because it has no equivalent and the "¼", "½", "¾" characters become > "1/4" and so on). Correct. If man treated a file on the filesystem as being in a different encoding just because you were using a different locale, that would be a bug in itself; files don't change encoding just because you set an environment variable. That said, using the latin1 device and then recoding to ISO-8859-15 is not really the best solution. I think it might be better to use the utf8 device and then recode to ISO-8859-15 from there. This doesn't entirely fix the problem, though; see below. > * Better solutions * > > In a second step, I tried to move the page iso_8859-* to a directory > whose name tells what the encoding is (I typically move the > iso_8859-15 page to a directory named "en.ISO8859-15/man7"). The pipeline > seems to become better as we now obtain: This is one approach, but a cleaner one would be to change the first line of iso-8859-15.7.gz to: '\" t -*- coding: ISO-8859-15 -*- (See manconv(1) for documentation of this.) Although you won't see evidence of this in the debugging output, this will cause manconv to ignore the input encoding(s) given to it and instead assume ISO-8859-15. Although see my comments below about bugs in this ... I've cloned this bug and reassigned the clone to manpages, since, regardless of any other work done in this area, any English manual pages that are not encoded in ISO-8859-1 or UTF-8 should state an explicit encoding using the above mechanism. > - with a utf8 locale: > > page_encoding = ISO-8859-15 > source_encoding = ISO-8859-1 > roff_encoding = ISO-8859-1 > output_encoding = UTF-8 > pipeline is: manconv -f UTF-8:ISO-8859-15 -t ISO-8859-1//IGNORE | nroff > -mandoc -Tutf8 > > - with an iso885...@euro locale: > > page_encoding = ISO-8859-15 > source_encoding = ISO-8859-1 > roff_encoding = ISO-8859-1 > output_encoding = ISO-8859-1 > pipeline is: manconv -f UTF-8:ISO-8859-15 -t ISO-8859-1//IGNORE | nroff > -mandoc -Tlatin1 | iconv -c -f ISO-8859-1 -t ISO-8859-15//TRANSLIT > > What is better, is that man has recognized that the encoding of the > page is iso-8859-15 (based on the directory name) but it has failed to > to propagate this information when it turned to find an encoding that > nroff supports. Something is strange there regarding the respective > roles of the "source" and "page" encodings in the calls to manconv and > roff. > > From what I understand (but I'm uncertain), nroff does not support > multibyte characters and hence, pages have to be converted to > single-byte characters using the ascii8 device (it seems there is > something special for east-asia languages but I don't understand well > how it works). The problem seems to be that the single-byte encoding > used to call nroff forgets about the encoding mentioned in the > directory name and only keeps the language part of the directory name, > then reassigning to each language a canonical default encoding. This > strategy would be good for pages encoded in utf8: since nroff does not > support utf8, we assume that, say, a Polish page in utf8 can always be > converted to the single-byte iso-8859-2 encoding. But this strategy > losses information when we already know that the page is encoded in a > single-byte encoding. I agree that the recoding from one legacy encoding to another loses information, and this is definitely a bug. It's important to remember that, with some exceptions, the current version of groff in Debian cannot really be told to use a different input encoding, which is where a lot of this weirdness comes from. It's not just about single-byte vs. multibyte; with the exception of some hacks for CJK (the nippon device), and the awful, awful ascii8 hack, groff always assumes that its input is ISO-8859-1. This has been fixed upstream by the introduction of the preconv preprocessor, which will allow man to feed in any input encoding it likes and have preconv convert it to a notation involving Unicode codepoints that the groff core can understand. man-db is already prepared to use this once it's available. However, there is one last significant blocker to upgrading the Debian package, namely the introduction of character class support so that the new groff can format CJK text reasonably without the massive non-forward-portable Debian patch. I'm working on this on and off at the moment. Now, we can work around this somewhat by using the awful, awful hack I mentioned above: the purpose of the ascii8 device is that its output encoding is always the same as its input encoding (so far from converting multibyte characters to single-byte characters, the ascii8 device exists to perform no conversion at all). This is typographically unsound because groff is not supposed to just pass through character data, but also to interpret it (e.g. hyphenation) and unless it knows what characters are which it can't do its job properly. Nevertheless, in the case of manual pages the consequences are not too bad, so this will do as a workaround for the time being. Using the ascii8 device for pages declared as ISO-8859-15 in their preprocessor line breaks in man-db 2.5.4-1 for the following reasons: * manconv doesn't spot the -*- coding -*- line, because zsoelim puts a ".lf 1 -" line number marker before it. I've fixed this upstream by arranging for zsoelim to put the .lf request after any leading comment line. * The -*- coding -*- line is only read by manconv, not man itself. Thus, man recodes to ISO-8859-1 unnecessarily when it should realise that it needs to just use ISO-8859-15 all the way (with ascii8). I've fixed this upstream. > My suggestions then are: > > - Change the definition of "source encoding" so that if the language > directory name already mentions a single-byte encoding (say one of > the iso-8859-* encodings), it considers it to be the source encoding > and looks for a language-based canonical single-byte encoding (table > directory_table in file encodings.c of the man package) only if the > language directory name tells it is an UTF-8 page. I think that makes sense on the grounds that recoding between legacy encodings tends to do more harm than good, and have implemented this. > - Move the English-written pages using iso-8859-X encodings in > directories named en.ISO8859-X (this is about the manpages package). As mentioned above, I think this is better done by way of a preprocessor encoding declaration, assuming the fixes I've applied upstream and intend to backport to Debian. That's neater than creating new directories for a small number of pages. Tue Mar 10 23:24:27 GMT 2009 Colin Watson <cjwat...@debian.org> Fix handling of pages that declare a non-default encoding in their preprocessor lines. Thanks to Hugo Herbelin for some of the ideas here (Debian bug #519095). * src/encodings.c (get_source_encoding): Note that this function should only be called if the page encoding is UTF-8. Add another example. * src/manconv.c (check_preprocessor_encoding): Move to ... * src/encodings.c (check_preprocessor_encoding): ... here. * src/encodings.h (check_preprocessor_encoding): Add prototype. * src/man.c (make_roff_command): Use preprocessor-declared encoding as page_encoding if known. Set source_encoding to page_encoding unless the latter is UTF-8. * src/Makefile.am (manconv_SOURCES): Add encodings.c. * src/encodings.c (charset_table): Use ISO-8859-15 -> latin1 entry only in the !MULTIBYTE_GROFF case; true ISO-8859-15 pages are better handled using ascii8 or preconv if possible. Tue Mar 10 14:11:14 GMT 2009 Colin Watson <cjwat...@debian.org> * src/zsoelim.l (zsoelim_parse_file): Put the initial .lf request after any initial comment line, so that manconv can find encoding instructions more easily. Thanks a lot, -- Colin Watson [cjwat...@debian.org] -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org