Parsing unicode (devanagari) text with xml.dom.minidom

rparimi Sat, 07 Mar 2009 17:25:45 -0800

Hello,

I am trying to process an xml file that contains unicode characters
(see http://vyakarnam.wordpress.com/). Wordpress allows exporting the
entire content of the website into an xml file. Using
xml.dom.minidom,  I wrote a few lines of python code to parse out the
xml file, but am stuck with the following error:


>>> import xml.dom.minidom
>>> dom = xml.dom.minidom.parse("wordpress.2009-02-19.xml")
>>> titles = dom.getElementsByTagName("title")
>>> for title in titles:
...    print "childNode = ", title.childNodes
...
childNode =  [<DOM Text node "Sanskrit N...">]
childNode =  [<DOM Text node "Sanskrit N...">]
childNode =  []
childNode =  []
childNode =  [<DOM Text node "1-1-1">]
childNode =  Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
16-18: ordinal not in range(128)
>>>

Python exited when it was trying to parse the following node:
<title>अन् </title>

The xml header tells me that the document is UTF-8:
<?xml version="1.0" encoding="UTF-8"?>

I am running python 2.5.1 on Mac OSX 10.5.6 and my local settings are
as below:
$locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=


I googled around for similar errors, and tried using unicode but that
didn't help either:
>>> foo = unicode(titles[5].childNodes)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
16-18: ordinal not in range(128)

I'm a novice with unicode, and am not not sure about how best to
handle the unicode  text I'm dealing with (devanagari). Any
suggestions will be helpful.

Thanks
--
http://mail.python.org/mailman/listinfo/python-list

Parsing unicode (devanagari) text with xml.dom.minidom

Reply via email to