package w3m
retitle 338264 Improve conversion from GB2312 to Big5 characters
thanks

On Wed, Nov 09, 2005 at 03:37:17AM +0800, Dan Jacobson wrote:
> Package: w3m
> Version: 0.5.1-4
> Severity: wishlist
> 
> w3m has big problems reading a file full of HTML entities.
> $ w3m http://ehp.niehs.nih.gov/cehp/docs/2005/113-1c/toc.html
> We see lots of "?". Firefox doesn't have any problems.
> 
> Even after
> $ wwwoffle -o http://ehp.niehs.nih.gov/cehp/docs/2005/113-1c/toc.html|\
> perl -pwe 'use HTML::Entities;$_=decode_entities($_);\
> s/gb2312/utf-8/'>file.html
> w3m has problems.
> 
> OK, I was finally able to prepare it for a big5 PDA:
> wwwoffle -o http://ehp.niehs.nih.gov/cehp/docs/2005/113-1c/toc.html|\
> perl -pwe 'use HTML::Entities;$_=decode_entities($_);\
> s/gb2312/big5/'|iconv -f utf-8 -t gb2312 -c|\
> iconv -f gb2312 -t big5 -c > file.html
> 
> We note the two iconv steps probably due to thier non complete mapping
> which I recall telling them. Also there is in fact no gb2312 in the original 
> file.
> -- System Information:
> Locale: LANG=zh_TW.Big5, LC_CTYPE=zh_TW.Big5 (charmap=BIG5)
> 
The problem are not the HTML entities but the conversion from GB2312 to
Big5 characters. The entities in the file specify characters which map
directly to characters in GB2312. But they are not directly mapped to Big5,
because the mapping from Simple Chinese to Traditional Chinese is
ambiguous and e.g. depending on the context.

A solution to this is using either a GB2312 or UTF-8 locale. With
LANG=zh_CN.GB2312 there were only two question marks left and using it
with LANG=zh_TW.UTF-8 all symbols were displayed.

Regards,
Karsten Schölzel

Attachment: signature.asc
Description: Digital signature

Reply via email to