Package: man-db
Version: 2.5.4-1
Severity: wishlist
Tags: l10n

Hi,

My primary wish was to be able to correctly display the pages
iso_8859-* and I end up with a suggestion for better supporting
all pages encoded in one of the iso-8859-X coding systems.

Here were my successive experiences for displaying, e.g., the
iso_8859-15 man page:

* Bad solutions *

- If I set my locale to utf8, I see all non-ascii characters in the
  iso_8859-* pages as if they were iso-8859-1 characters. As reported
  by "man -d", the information in the pipeline that is relevant to the
  encoding is:

  manconv -f UTF-8:ISO-8859-1 -t ISO-8859-1//IGNORE | nroff -mandoc -Tutf8

  and indeed, nroff assumes having latin1 as default input and utf8 in
  output.

- If I set my locale to iso885...@euro, I see "?" for the euro sign
  and "1/4", "1/2" and "3/4" for the oe ligature and Y with
  diaeresis. Indeed, the pipeline is

  manconv -f UTF-8:ISO-8859-1 -t ISO-8859-1//IGNORE | nroff -mandoc -Tlatin1 | 
iconv -c -f ISO-8859-1 -t ISO-8859-15//TRANSLIT 

  which does as if the page were in ISO-8859-1 (while in fact it is in
  ISO-8859-15) and translate what it thinks are ISO-8859-1 chars into
  valid ISO-8859-15 sequences (the "¤" currency sign becomes "?"
  because it has no equivalent and the "¼", "½", "¾" characters become
  "1/4" and so on).

* Better solutions *

In a second step, I tried to move the page iso_8859-* to a directory
whose name tells what the encoding is (I typically move the
iso_8859-15 page to a directory named "en.ISO8859-15/man7"). The pipeline
seems to become better as we now obtain:

- with a utf8 locale:

  page_encoding = ISO-8859-15
  source_encoding = ISO-8859-1
  roff_encoding = ISO-8859-1
  output_encoding = UTF-8
  pipeline is: manconv -f UTF-8:ISO-8859-15 -t ISO-8859-1//IGNORE | nroff 
-mandoc -Tutf8

- with an iso885...@euro locale:

  page_encoding = ISO-8859-15
  source_encoding = ISO-8859-1
  roff_encoding = ISO-8859-1
  output_encoding = ISO-8859-1
  pipeline is: manconv -f UTF-8:ISO-8859-15 -t ISO-8859-1//IGNORE | nroff 
-mandoc -Tlatin1 | iconv -c -f ISO-8859-1 -t ISO-8859-15//TRANSLIT

What is better, is that man has recognized that the encoding of the
page is iso-8859-15 (based on the directory name) but it has failed to
to propagate this information when it turned to find an encoding that
nroff supports. Something is strange there regarding the respective
roles of the "source" and "page" encodings in the calls to manconv and
roff.

>From what I understand (but I'm uncertain), nroff does not support
multibyte characters and hence, pages have to be converted to
single-byte characters using the ascii8 device (it seems there is
something special for east-asia languages but I don't understand well
how it works). The problem seems to be that the single-byte encoding
used to call nroff forgets about the encoding mentioned in the
directory name and only keeps the language part of the directory name,
then reassigning to each language a canonical default encoding. This
strategy would be good for pages encoded in utf8: since nroff does not
support utf8, we assume that, say, a Polish page in utf8 can always be
converted to the single-byte iso-8859-2 encoding. But this strategy
losses information when we already know that the page is encoded in a
single-byte encoding.

My suggestions then are:

- Change the definition of "source encoding" so that if the language
  directory name already mentions a single-byte encoding (say one of
  the iso-8859-* encodings), it considers it to be the source encoding
  and looks for a language-based canonical single-byte encoding (table
  directory_table in file encodings.c of the man package) only if the
  language directory name tells it is an UTF-8 page.

- Move the English-written pages using iso-8859-X encodings in
  directories named en.ISO8859-X (this is about the manpages package).

Hoping I did not miss some other complex parts of the conversion
process...

Hugo Herbelin

Remark: assuming the current version of man-db, there is still a
(tedious) workaround to see correctly the iso_8859-* pages: to see a
page with the iso-8859-X encoding, choose a iso88591 locale
(e.g. fr_FR.iso88591) so that no translation happens, and display the
result in a terminal, setting first the terminal to believe it is
displaying iso-8859-X text.

-- System Information:
Debian Release: squeeze/sid
  APT prefers unstable
  APT policy: (500, 'unstable'), (500, 'testing')
Architecture: amd64 (x86_64)

Kernel: Linux 2.6.27-1-amd64 (SMP w/2 CPU cores)
Locale: LANG=fr_FR.utf-8, LC_CTYPE=fr_FR.utf-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/bash

Versions of packages man-db depends on:
ii  bsdmainutils           6.1.10            collection of more utilities from 
ii  debconf [debconf-2.0]  1.5.26            Debian configuration management sy
ii  dpkg                   1.14.25           Debian package management system
ii  groff-base             1.18.1.1-21       GNU troff text-formatting system (
ii  libc6                  2.9-4             GNU C Library: Shared libraries
ii  libgdbm3               1.8.3-4           GNU dbm database routines (runtime
ii  zlib1g                 1:1.2.3.3.dfsg-13 compression library - runtime

man-db recommends no packages.

Versions of packages man-db suggests:
ii  epiphany-gecko [www-browser 2.22.3-9     Intuitive GNOME web browser - Geck
ii  groff                       1.18.1.1-21  GNU troff text-formatting system
ii  iceweasel [www-browser]     3.0.7-1      lightweight web browser based on M
ii  less                        418-1        Pager program similar to more
ii  lynx-cur [www-browser]      2.8.7dev13-1 Text-mode WWW Browser with NLS sup
ii  w3m [www-browser]           0.5.2-2+b1   WWW browsable pager with excellent

-- debconf information excluded



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Reply via email to