On Thu, Jan 09, 2014 at 09:50:24AM +0100, Garry Bettle wrote: > I'm trying to parse some XML and I'm struggling to reference elements that > contain foreign characters.
I see from your use of print that you're using Python 2. That means that strings '' are actually byte-strings, not text-strings. That makes it really easy for mojibake to creep into your program. Even though you define a coding line for your file (UTF-8, well done!) that only effects how Python reads the source code, not how it runs the code. So when you have this line: stock=product.getElementsByTagName('AntalPåLager')[0].firstChild.nodeValue the tag name 'AntalPåLager' is a *byte* string, not the text that you include in your file. Let's see what Python does with it in version 2.7. This is what I get on my default system: py> s = 'AntalPåLager' py> print repr(s) 'AntalP\xc3\xa5Lager' You might get something different. What are those two weird escaped bytes doing in there, instead of å ? They come about because the string s is treated as bytes rather than characters. Python 2 tries really hard to hide this fact from you -- for instance, it shows some bytes as ASCII characters A, n, t, a, etc. But you can't escape from the fact that they're actually bytes, eventually it will cause a problem, and here it is: > Traceback (most recent call last): > File "C:\Python27\Testing Zizzi.py", line 16, in <module> > > stock=product.getElementsByTagName('AntalPÃ¥Lager')[0].firstChild.nodeValue > IndexError: list index out of range See the tag name printed in the error message? 'AntalPÃ¥Lager'. That is a classic example of mojibake, caused by takes bytes interpreted in one encoding (say, UTF-8) and incorrectly interpreting them under another encoding (say, Latin-1). There is one right way, and one half-right way, to handle text in Python 2. They are: - The right way is to always use Unicode text instead of bytes. Instead of 'AntalPåLager', use the u prefix to get a Unicode string: u'AntalPåLager' - The half-right way is to only use ASCII, and then you can get away with '' strings without the u prefix. Americans and English almost always can get away with this, so they often think that Unicode is a waste of time. My advise is to change all the strings in your program from '' strings to u'' strings, and see if the problem is fixed. But it may not be -- I'm not an expert on XML processing, and it may turn out that minidom complains about the use of Unicode strings. Try it and see. I expect (but don't know for sure) that what is happening is that you have an XML file with a tag AntalPåLager, but due to the mojibake problem, Python is looking for a non-existent tag AntalPÃ¥Lager and returning an empty list. When you try to index into that list, it's empty and so you get the exception. -- Steven _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor