[issue18268] ElementTree.fromstring non-deterministically gives unicode text data

2013-06-19 Thread Brendan O'Connor

New submission from Brendan O'Connor:

(This is Python 2.7 so I'm using string vs unicode terminology.)

When I use ElementTree.fromstring(), and use the .text field on nodes, the 
value is usually a string object, but in rare cases it's a unicode object.  I'm 
parsing many XML documents of newspaper text [1]; on one subset of the data, 
out of 5 million nodes, ~200 of them have a unicode object for the .text field.

I think this is all related to http://bugs.python.org/issue11033 but I can't 
figure out how, exactly.  I'm passing in strings to ElementTree.fromstring() 
like you're supposed to.

The workaround is to defensively convert the .text value to unicode [3].

[1] data is 
http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2012T21

[2] my processing code is 
https://github.com/brendano/gigaword_conversion/blob/master/annogw2justsent.py

[3]

def convert_to_unicode(mystr):
if isinstance(mystr, unicode):
return mystr
if isinstance(mystr, str):
return mystr.decode('utf8')

--
messages: 191496
nosy: Brendan.OConnor
priority: normal
severity: normal
status: open
title: ElementTree.fromstring non-deterministically gives unicode text data
type: behavior
versions: Python 2.7

___
Python tracker 
<http://bugs.python.org/issue18268>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18268] ElementTree.fromstring non-deterministically gives unicode text data

2013-06-19 Thread Brendan O'Connor

Brendan O'Connor added the comment:

By "non-deterministic" I just mean that the conversion happens for some data 
but not other data.  I should try to find examples that causes it to happen.

--

___
Python tracker 
<http://bugs.python.org/issue18268>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11033] ElementTree.fromstring doesn't work with Unicode

2013-08-04 Thread Brendan O'Connor

Brendan O'Connor added the comment:

Sure, go ahead and close it.  I was just trying to be helpful and report a bug 
in the Python standard library.  I don't use Python 3.3 so cannot test it.

--
nosy: +Brendan.OConnor

___
Python tracker 
<http://bugs.python.org/issue11033>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com