Package: html2text
Version: 1.3.2a-6
Severity: minor

Hello,

As the information below says, I'm not using a UTF-8 locale.  html2text
will however, on utf-8 html pages, produce UTF-8 text.  Conversely, on a
UTF-8 system, html2text will, on latin1 html pages, produce latin1 text.
The recently added -utf8 option handles the UTF-8 on UTF-8 case, but
not the two cases above.

Generally speaking, there is no reason why the input and output charsets
should be related at all.  For the input, html2text should recognize
the meta http-equiv tag; that should work for a lot of pages, else an
input-charset option can be provided.  For the output, the current
locale's charset should be used (as returned by nl_langinfo(CODESET)
after calling setlocale(LC_CTYPE,"")); that should work in almost all
cases, else an output-charset option can be provided.

Yes, that means conversions.  But that's the way charsets are supposed
to be handled.  Note btw that for the conversions, one can just use
iconv_open(nl_langinfo(CODESET), page_charset), but can can also append
"//translit" to nl_langinfo(CODESET), so that iconv makes the
transliterations itself, i.e. turn curly quotes and long dashes into
equivalents in the target charset.

Samuel

-- System Information:
Debian Release: lenny/sid
  APT prefers testing
  APT policy: (990, 'testing'), (500, 'unstable'), (500, 'stable'), (1, 
'experimental')
Architecture: i386 (i686)

Kernel: Linux 2.6.26
Locale: [EMAIL PROTECTED], [EMAIL PROTECTED] (charmap=ISO-8859-15)
Shell: /bin/sh linked to /bin/bash

Versions of packages html2text depends on:
ii  libc6                         2.7-13     GNU C Library: Shared libraries
ii  libgcc1                       1:4.3.1-2  GCC support library
ii  libstdc++6                    4.3.1-2    The GNU Standard C++ Library v3

html2text recommends no packages.

Versions of packages html2text suggests:
ii  curl                          7.18.2-5   Get a file from an HTTP, HTTPS or 
ii  wget                          1.11.4-1   retrieves files from the web

-- no debconf information



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

Reply via email to