Hello,
I am trying to process an xml file that contains unicode characters
(see http://vyakarnam.wordpress.com/). Wordpress allows exporting the
entire content of the website into an xml file. Using
xml.dom.minidom, I wrote a few lines of python code to parse out the
xml file, but am stuck with the following error:
>>> import xml.dom.minidom
>>> dom = xml.dom.minidom.parse("wordpress.2009-02-19.xml")
>>> titles = dom.getElementsByTagName("title")
>>> for title in titles:
... print "childNode = ", title.childNodes
...
childNode = [<DOM Text node "Sanskrit N...">]
childNode = [<DOM Text node "Sanskrit N...">]
childNode = []
childNode = []
childNode = [<DOM Text node "1-1-1">]
childNode = Traceback (most recent call last):
File "<stdin>", line 2, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
16-18: ordinal not in range(128)
>>>
Python exited when it was trying to parse the following node:
<title>अन् </title>
The xml header tells me that the document is UTF-8:
<?xml version="1.0" encoding="UTF-8"?>
I am running python 2.5.1 on Mac OSX 10.5.6 and my local settings are
as below:
$locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
I googled around for similar errors, and tried using unicode but that
didn't help either:
>>> foo = unicode(titles[5].childNodes)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
16-18: ordinal not in range(128)
I'm a novice with unicode, and am not not sure about how best to
handle the unicode text I'm dealing with (devanagari). Any
suggestions will be helpful.
Thanks
--
http://mail.python.org/mailman/listinfo/python-list