[issue10149] Data truncation in expat parser
New submission from Maciek J : Not sure if this is a Python problem or an expat problem, but I get truncated data while parsing XML documents. This particular project is for parsing an XML file of Wikipedia dump. The attached files are: * xml-parse-revisions.py - parser script * revision-test.xml - input XML * revision-test.xml.sql - output XML * revision_create.sql - not really needed for this test case, but attached for completeness You can notice that the output file sometimes contains too short values for the "timestamp". Also note that if you add whitespace to the input XML, then different timestamps will be truncated. My Python is 2.6.6. -- components: XML files: pyxml_error.zip messages: 119184 nosy: Maciek.J priority: normal severity: normal status: open title: Data truncation in expat parser versions: Python 2.6 Added file: http://bugs.python.org/file19292/pyxml_error.zip ___ Python tracker <http://bugs.python.org/issue10149> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10149] Data truncation in expat parser
Maciek J added the comment: Hm... It turns out that there is a "buffer_text" attribute: http://docs.python.org/library/pyexpat.html#xml.parsers.expat.xmlparser.buffer_text And setting this attribute to "True" seems to solve the problem. It solves my problem, but docs are still very confusing. I see two things that should be fixed: 1. In CharacterDataHandler description it should be explicitly noted that data may be chunked even if it is short(!). 2. Description of buffer_text attribute should contain a notice that data may also be arbitrary chunked if this is set to False. My data _was_not_ chunked at new line characters (as the description suggest). It was chunked in the middle of a sentence (there were no whitespace in it!). -- ___ Python tracker <http://bugs.python.org/issue10149> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10149] Data truncation in expat parser
Maciek J added the comment: I'm not familiar with the rst format, but I hope this works. -- keywords: +patch Added file: http://bugs.python.org/file19329/pyexpat.rst.patch ___ Python tracker <http://bugs.python.org/issue10149> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10149] Data truncation in expat parser
Maciek J added the comment: Couldn't compile to html at the moment, but it should be fine anyway. Note that I didn't wanted to start a new paragraph (I'm guessing you meant the sentence at line 13 of the patch) as there was no new paragraph in a previous version. -- Added file: http://bugs.python.org/file19599/pyexpat.rst.patch ___ Python tracker <http://bugs.python.org/issue10149> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com