[issue10149] Data truncation in expat parser

2010-10-19 Thread Maciek J

New submission from Maciek J :

Not sure if this is a Python problem or an expat problem, but I get truncated 
data while parsing XML documents.

This particular project is for parsing an XML file of Wikipedia dump.

The attached files are:
* xml-parse-revisions.py - parser script
* revision-test.xml - input XML
* revision-test.xml.sql - output XML
* revision_create.sql - not really needed for this test case, but attached for 
completeness

You can notice that the output file sometimes contains too short values for the 
"timestamp". Also note that if you add whitespace to the input XML, then 
different timestamps will be truncated.

My Python is 2.6.6.

--
components: XML
files: pyxml_error.zip
messages: 119184
nosy: Maciek.J
priority: normal
severity: normal
status: open
title: Data truncation in expat parser
versions: Python 2.6
Added file: http://bugs.python.org/file19292/pyxml_error.zip

___
Python tracker 
<http://bugs.python.org/issue10149>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10149] Data truncation in expat parser

2010-10-20 Thread Maciek J

Maciek J  added the comment:

Hm... It turns out that there is a "buffer_text" attribute:
http://docs.python.org/library/pyexpat.html#xml.parsers.expat.xmlparser.buffer_text
And setting this attribute to "True" seems to solve the problem.

It solves my problem, but docs are still very confusing. I see two things that 
should be fixed:
1. In CharacterDataHandler description it should be explicitly noted that data 
may be chunked even if it is short(!).
2. Description of buffer_text attribute should contain a notice that data may 
also be arbitrary chunked if this is set to False. My data _was_not_ chunked at 
new line characters (as the description suggest). It was chunked in the middle 
of a sentence (there were no whitespace in it!).

--

___
Python tracker 
<http://bugs.python.org/issue10149>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10149] Data truncation in expat parser

2010-10-21 Thread Maciek J

Maciek J  added the comment:

I'm not familiar with the rst format, but I hope this works.

--
keywords: +patch
Added file: http://bugs.python.org/file19329/pyexpat.rst.patch

___
Python tracker 
<http://bugs.python.org/issue10149>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10149] Data truncation in expat parser

2010-11-13 Thread Maciek J

Maciek J  added the comment:

Couldn't compile to html at the moment, but it should be fine anyway.

Note that I didn't wanted to start a new paragraph (I'm guessing you meant the 
sentence at line 13 of the patch) as there was no new paragraph in a previous 
version.

--
Added file: http://bugs.python.org/file19599/pyexpat.rst.patch

___
Python tracker 
<http://bugs.python.org/issue10149>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com