package w3m retitle 338264 Improve conversion from GB2312 to Big5 characters thanks
On Wed, Nov 09, 2005 at 03:37:17AM +0800, Dan Jacobson wrote: > Package: w3m > Version: 0.5.1-4 > Severity: wishlist > > w3m has big problems reading a file full of HTML entities. > $ w3m http://ehp.niehs.nih.gov/cehp/docs/2005/113-1c/toc.html > We see lots of "?". Firefox doesn't have any problems. > > Even after > $ wwwoffle -o http://ehp.niehs.nih.gov/cehp/docs/2005/113-1c/toc.html|\ > perl -pwe 'use HTML::Entities;$_=decode_entities($_);\ > s/gb2312/utf-8/'>file.html > w3m has problems. > > OK, I was finally able to prepare it for a big5 PDA: > wwwoffle -o http://ehp.niehs.nih.gov/cehp/docs/2005/113-1c/toc.html|\ > perl -pwe 'use HTML::Entities;$_=decode_entities($_);\ > s/gb2312/big5/'|iconv -f utf-8 -t gb2312 -c|\ > iconv -f gb2312 -t big5 -c > file.html > > We note the two iconv steps probably due to thier non complete mapping > which I recall telling them. Also there is in fact no gb2312 in the original > file. > -- System Information: > Locale: LANG=zh_TW.Big5, LC_CTYPE=zh_TW.Big5 (charmap=BIG5) > The problem are not the HTML entities but the conversion from GB2312 to Big5 characters. The entities in the file specify characters which map directly to characters in GB2312. But they are not directly mapped to Big5, because the mapping from Simple Chinese to Traditional Chinese is ambiguous and e.g. depending on the context. A solution to this is using either a GB2312 or UTF-8 locale. With LANG=zh_CN.GB2312 there were only two question marks left and using it with LANG=zh_TW.UTF-8 all symbols were displayed. Regards, Karsten Schölzel
signature.asc
Description: Digital signature