Package: html2text Version: 1.3.2a-6 Severity: minor Hello,
As the information below says, I'm not using a UTF-8 locale. html2text will however, on utf-8 html pages, produce UTF-8 text. Conversely, on a UTF-8 system, html2text will, on latin1 html pages, produce latin1 text. The recently added -utf8 option handles the UTF-8 on UTF-8 case, but not the two cases above. Generally speaking, there is no reason why the input and output charsets should be related at all. For the input, html2text should recognize the meta http-equiv tag; that should work for a lot of pages, else an input-charset option can be provided. For the output, the current locale's charset should be used (as returned by nl_langinfo(CODESET) after calling setlocale(LC_CTYPE,"")); that should work in almost all cases, else an output-charset option can be provided. Yes, that means conversions. But that's the way charsets are supposed to be handled. Note btw that for the conversions, one can just use iconv_open(nl_langinfo(CODESET), page_charset), but can can also append "//translit" to nl_langinfo(CODESET), so that iconv makes the transliterations itself, i.e. turn curly quotes and long dashes into equivalents in the target charset. Samuel -- System Information: Debian Release: lenny/sid APT prefers testing APT policy: (990, 'testing'), (500, 'unstable'), (500, 'stable'), (1, 'experimental') Architecture: i386 (i686) Kernel: Linux 2.6.26 Locale: [EMAIL PROTECTED], [EMAIL PROTECTED] (charmap=ISO-8859-15) Shell: /bin/sh linked to /bin/bash Versions of packages html2text depends on: ii libc6 2.7-13 GNU C Library: Shared libraries ii libgcc1 1:4.3.1-2 GCC support library ii libstdc++6 4.3.1-2 The GNU Standard C++ Library v3 html2text recommends no packages. Versions of packages html2text suggests: ii curl 7.18.2-5 Get a file from an HTTP, HTTPS or ii wget 1.11.4-1 retrieves files from the web -- no debconf information -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

