Bug#591508: xerces-c CAN NOT deal with big MBCS-encoded file.

Kirby Zhou Mon, 09 Aug 2010 06:03:19 -0700

My mistake. 'With IconvGNU' is a bug unrelated to Debian, you can ignore it.

But the ICU bug is actually exist.

You must try '-x gbk', because of '-x UTF-8' and '-x GBK' takes different
code 
path in xerces-c, the former is done by xerces-c itself, the other is done
by 
ICU.

  Regards
  Kirby Zhou

-----Original Message-----
From: Jay Berkenbilt [mailto:q...@debian.org] 
Sent: Monday, August 09, 2010 8:16 AM
To: Kirby Zhou
Cc: 591...@bugs.debian.org
Subject: Re: Bug#591508: xerces-c CAN NOT deal with big MBCS-encoded file.

"Kirby Zhou" <kirbyz...@gmail.com> wrote:

> If a huge file passed to XMLReader, it will call TransService mulitple
> times,
> and splite the file content into several fragments. 
> Unfortunately, the fragment will contain incomplete multi-byte characters.

> But neither ICUTransService nor IconvGNUransService deal with it.
> ICUTransService did not deal with U_TRUNCATED_CHAR_FOUND, and
> IconvGNUransService did not deal with EINVAL. 
>
> 2.7.0, 2.8.0, 3.0.1, 3.1.1 have the same bug. 

I'm afraid I'm not seeing the behavior you're describing.

> # compile the SAXPrint example of xerces-c.

Note that you can get SAXPrint by installing the libxerces-c-samples
package, which is what I have done to reproduce this.

> ]# ( echo '<?xml version="1.0" encoding="GBK" ?>'; echo '<data>'; for
> ((i=0;i<2;++i)); do echo -en '\xd6\xd0\xce\xc4\xba\xba\xd7\xd6A'; done ;
> echo; echo '</data>' ) > ~/small.xml
>
> ]# ( echo '<?xml version="1.0" encoding="GBK" ?>'; echo '<data>'; for
> ((i=0;i<100000;++i)); do echo -en '\xd6\xd0\xce\xc4\xba\xba\xd7\xd6A';
done
> ; echo; echo '</data>' ) > ~/big.xml 
>
> # the small.xml and big.xml are analogical. 

Okay so far.

> ]# samples/SAXPrint ~/small.xml 
> <?xml version="1.0" encoding="LATIN1"?>
> <data>
> &#x4e2D;&#x6587;&#x6C49;&#x5B57;A&#x4e2D;&#x6587;&#x6C49;&#x5B57;A
> </data>

I see this same thing.  However, since your original file contains
characters that can't be represented in LATIN1, why not use something
like

samples/SAXPrint -X=UTF-8 ~/small.xml

For that, you get UTF-8 representations of these characters.

> # with icu 
> ]# samples/SAXPrint ~/big.xml 
> <?xml version="1.0" encoding="gbk"?> 
> <data> 
> Fatal Error at file /root/big.xml, line 3, char 16377 
>   Message: char 0x6C49 is not representable in 'gbk' encoding

When you say "with icu", I'm not exactly sure what you meaning.
Debian's xerces packages are compiled with ICU, and there's not any way
to my knowledge to not get ICU without recompiling xerces-c.

I'm not sure why the first line above is

<?xml version="1.0" encoding="gbk"?>

It looks to me like maybe you're trying to read a file encoded one way
as if it were encoded another way.  0x6C49 appears to be the Unicode
value for the third character in your original small.xml.

> # with iconvgnu 
> ]# samples/SAXPrint ~/big.xml 
> <?xml version="1.0" encoding="LATIN1"?>
> <data>
> Fatal Error at file /root/big.xml, line 3, char 16377 
>   Message: invalid multi-byte sequence

Again, I'm not sure what you mean by "with iconvgnu".

I'm able to use SAXPrint from the debian libxerces-c-samples package to
transcode the xml files you've provided between GBK and UTF-8 without a
problem, and I'm not seeing the errors you've indicated.

Perhaps you can clarify a little what you're doing.  If there's a bug
here, I'll report it to upstream.  If it's just a question of
understanding the tool, perhaps I can help there too, or I can refer you
to the upstream mailing list for additional assistance.

Thanks for the report.  I'll hold the bug open until we figure out where
the confusion is.

-- 
Jay Berkenbilt <q...@debian.org>

-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Bug#591508: xerces-c CAN NOT deal with big MBCS-encoded file.

Reply via email to