My mistake. 'With IconvGNU' is a bug unrelated to Debian, you can ignore it.
But the ICU bug is actually exist. You must try '-x gbk', because of '-x UTF-8' and '-x GBK' takes different code path in xerces-c, the former is done by xerces-c itself, the other is done by ICU. Regards Kirby Zhou -----Original Message----- From: Jay Berkenbilt [mailto:q...@debian.org] Sent: Monday, August 09, 2010 8:16 AM To: Kirby Zhou Cc: 591...@bugs.debian.org Subject: Re: Bug#591508: xerces-c CAN NOT deal with big MBCS-encoded file. "Kirby Zhou" <kirbyz...@gmail.com> wrote: > If a huge file passed to XMLReader, it will call TransService mulitple > times, > and splite the file content into several fragments. > Unfortunately, the fragment will contain incomplete multi-byte characters. > But neither ICUTransService nor IconvGNUransService deal with it. > ICUTransService did not deal with U_TRUNCATED_CHAR_FOUND, and > IconvGNUransService did not deal with EINVAL. > > 2.7.0, 2.8.0, 3.0.1, 3.1.1 have the same bug. I'm afraid I'm not seeing the behavior you're describing. > # compile the SAXPrint example of xerces-c. Note that you can get SAXPrint by installing the libxerces-c-samples package, which is what I have done to reproduce this. > ]# ( echo '<?xml version="1.0" encoding="GBK" ?>'; echo '<data>'; for > ((i=0;i<2;++i)); do echo -en '\xd6\xd0\xce\xc4\xba\xba\xd7\xd6A'; done ; > echo; echo '</data>' ) > ~/small.xml > > ]# ( echo '<?xml version="1.0" encoding="GBK" ?>'; echo '<data>'; for > ((i=0;i<100000;++i)); do echo -en '\xd6\xd0\xce\xc4\xba\xba\xd7\xd6A'; done > ; echo; echo '</data>' ) > ~/big.xml > > # the small.xml and big.xml are analogical. Okay so far. > ]# samples/SAXPrint ~/small.xml > <?xml version="1.0" encoding="LATIN1"?> > <data> > 中文汉字A中文汉字A > </data> I see this same thing. However, since your original file contains characters that can't be represented in LATIN1, why not use something like samples/SAXPrint -X=UTF-8 ~/small.xml For that, you get UTF-8 representations of these characters. > # with icu > ]# samples/SAXPrint ~/big.xml > <?xml version="1.0" encoding="gbk"?> > <data> > Fatal Error at file /root/big.xml, line 3, char 16377 > Message: char 0x6C49 is not representable in 'gbk' encoding When you say "with icu", I'm not exactly sure what you meaning. Debian's xerces packages are compiled with ICU, and there's not any way to my knowledge to not get ICU without recompiling xerces-c. I'm not sure why the first line above is <?xml version="1.0" encoding="gbk"?> It looks to me like maybe you're trying to read a file encoded one way as if it were encoded another way. 0x6C49 appears to be the Unicode value for the third character in your original small.xml. > # with iconvgnu > ]# samples/SAXPrint ~/big.xml > <?xml version="1.0" encoding="LATIN1"?> > <data> > Fatal Error at file /root/big.xml, line 3, char 16377 > Message: invalid multi-byte sequence Again, I'm not sure what you mean by "with iconvgnu". I'm able to use SAXPrint from the debian libxerces-c-samples package to transcode the xml files you've provided between GBK and UTF-8 without a problem, and I'm not seeing the errors you've indicated. Perhaps you can clarify a little what you're doing. If there's a bug here, I'll report it to upstream. If it's just a question of understanding the tool, perhaps I can help there too, or I can refer you to the upstream mailing list for additional assistance. Thanks for the report. I'll hold the bug open until we figure out where the confusion is. -- Jay Berkenbilt <q...@debian.org> -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org