Package: man-db Version: 2.5.4-1 Severity: wishlist Tags: l10n Hi,
My primary wish was to be able to correctly display the pages iso_8859-* and I end up with a suggestion for better supporting all pages encoded in one of the iso-8859-X coding systems. Here were my successive experiences for displaying, e.g., the iso_8859-15 man page: * Bad solutions * - If I set my locale to utf8, I see all non-ascii characters in the iso_8859-* pages as if they were iso-8859-1 characters. As reported by "man -d", the information in the pipeline that is relevant to the encoding is: manconv -f UTF-8:ISO-8859-1 -t ISO-8859-1//IGNORE | nroff -mandoc -Tutf8 and indeed, nroff assumes having latin1 as default input and utf8 in output. - If I set my locale to iso885...@euro, I see "?" for the euro sign and "1/4", "1/2" and "3/4" for the oe ligature and Y with diaeresis. Indeed, the pipeline is manconv -f UTF-8:ISO-8859-1 -t ISO-8859-1//IGNORE | nroff -mandoc -Tlatin1 | iconv -c -f ISO-8859-1 -t ISO-8859-15//TRANSLIT which does as if the page were in ISO-8859-1 (while in fact it is in ISO-8859-15) and translate what it thinks are ISO-8859-1 chars into valid ISO-8859-15 sequences (the "¤" currency sign becomes "?" because it has no equivalent and the "¼", "½", "¾" characters become "1/4" and so on). * Better solutions * In a second step, I tried to move the page iso_8859-* to a directory whose name tells what the encoding is (I typically move the iso_8859-15 page to a directory named "en.ISO8859-15/man7"). The pipeline seems to become better as we now obtain: - with a utf8 locale: page_encoding = ISO-8859-15 source_encoding = ISO-8859-1 roff_encoding = ISO-8859-1 output_encoding = UTF-8 pipeline is: manconv -f UTF-8:ISO-8859-15 -t ISO-8859-1//IGNORE | nroff -mandoc -Tutf8 - with an iso885...@euro locale: page_encoding = ISO-8859-15 source_encoding = ISO-8859-1 roff_encoding = ISO-8859-1 output_encoding = ISO-8859-1 pipeline is: manconv -f UTF-8:ISO-8859-15 -t ISO-8859-1//IGNORE | nroff -mandoc -Tlatin1 | iconv -c -f ISO-8859-1 -t ISO-8859-15//TRANSLIT What is better, is that man has recognized that the encoding of the page is iso-8859-15 (based on the directory name) but it has failed to to propagate this information when it turned to find an encoding that nroff supports. Something is strange there regarding the respective roles of the "source" and "page" encodings in the calls to manconv and roff. >From what I understand (but I'm uncertain), nroff does not support multibyte characters and hence, pages have to be converted to single-byte characters using the ascii8 device (it seems there is something special for east-asia languages but I don't understand well how it works). The problem seems to be that the single-byte encoding used to call nroff forgets about the encoding mentioned in the directory name and only keeps the language part of the directory name, then reassigning to each language a canonical default encoding. This strategy would be good for pages encoded in utf8: since nroff does not support utf8, we assume that, say, a Polish page in utf8 can always be converted to the single-byte iso-8859-2 encoding. But this strategy losses information when we already know that the page is encoded in a single-byte encoding. My suggestions then are: - Change the definition of "source encoding" so that if the language directory name already mentions a single-byte encoding (say one of the iso-8859-* encodings), it considers it to be the source encoding and looks for a language-based canonical single-byte encoding (table directory_table in file encodings.c of the man package) only if the language directory name tells it is an UTF-8 page. - Move the English-written pages using iso-8859-X encodings in directories named en.ISO8859-X (this is about the manpages package). Hoping I did not miss some other complex parts of the conversion process... Hugo Herbelin Remark: assuming the current version of man-db, there is still a (tedious) workaround to see correctly the iso_8859-* pages: to see a page with the iso-8859-X encoding, choose a iso88591 locale (e.g. fr_FR.iso88591) so that no translation happens, and display the result in a terminal, setting first the terminal to believe it is displaying iso-8859-X text. -- System Information: Debian Release: squeeze/sid APT prefers unstable APT policy: (500, 'unstable'), (500, 'testing') Architecture: amd64 (x86_64) Kernel: Linux 2.6.27-1-amd64 (SMP w/2 CPU cores) Locale: LANG=fr_FR.utf-8, LC_CTYPE=fr_FR.utf-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/bash Versions of packages man-db depends on: ii bsdmainutils 6.1.10 collection of more utilities from ii debconf [debconf-2.0] 1.5.26 Debian configuration management sy ii dpkg 1.14.25 Debian package management system ii groff-base 1.18.1.1-21 GNU troff text-formatting system ( ii libc6 2.9-4 GNU C Library: Shared libraries ii libgdbm3 1.8.3-4 GNU dbm database routines (runtime ii zlib1g 1:1.2.3.3.dfsg-13 compression library - runtime man-db recommends no packages. Versions of packages man-db suggests: ii epiphany-gecko [www-browser 2.22.3-9 Intuitive GNOME web browser - Geck ii groff 1.18.1.1-21 GNU troff text-formatting system ii iceweasel [www-browser] 3.0.7-1 lightweight web browser based on M ii less 418-1 Pager program similar to more ii lynx-cur [www-browser] 2.8.7dev13-1 Text-mode WWW Browser with NLS sup ii w3m [www-browser] 0.5.2-2+b1 WWW browsable pager with excellent -- debconf information excluded -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org