On 1/27/11, Steven D'Aprano <st...@pearwood.info> wrote: > Alex Hall wrote: >> Hello again: >> I have never seen this message before. I am pulling xml from a site's >> api and printing it, testing the wrapper I am writing for the api. I >> have never seen this error until just now, in the twelfth result of my >> search: >> UnicodeEncodeError: 'ASCII' codec can't encode character u'\u2019' in >> position 42: ordinal not in range(128) >> >> I tried making the strings Unicode by saying something like >> self.title=unicode(data.find("title").text) >> but the same error appeared. I found the manual chapter on this, but I >> am not sure I want to ignore since I do not know what this character >> (or others) might mean in the string. I am not clear on what 'replace' >> will do. Any suggestions? > > Short version > ============= > > You need to decode the bytes you get from the XML into unicode > characters. You would do this using something like: > > unicode(data.find("title").text, encoding='utf-8') > > If that doesn't work, change utf-8 to another encoding. If the XML file > tells you what the encoding should be, use that. > > Alternatively, you could say: > > unicode(data.find("title").text, errors='replace') > > to substitute a "missing character" glyph for any undecodable bytes in > the XML stream, or > > unicode(data.find("title").text, errors='ignore') > > to just ignore them. I tried both of those and got a different error. I have since fixed it so I no longer have the exact text, but it was something about not supporting convertion from unicode. I finally ended up doing this: self.title=data.find("title").text.encode("utf-8") and it seems happy enough, though I get odd characters above 128. I suppose it is better than a traceback, and I suspect I just have the wrong character set. Still, I found it very odd that unicode(string, errors='replace') threw an exception. > > > Long version > ============ > > You can't just say "turn these bytes into unicode" and expect it to > magically work. Remember, in Python 2, so-called "strings" are actually > strings of *bytes*, not characters. If you're a native English speaker, > you've probably never needed to care about the distinction, but it is real. > > When you have a string "spam", what that *really* is is a sequence of > bytes 73 70 61 6D (in hexadecimal). By convention, Python uses the ASCII > encoding map bytes to characters (e.g. hex 73 <=> "s"). That's not the > only choice, but it has been the conventional choice for so long that > people have forgotten that there are any other choices. > > The problem with ASCII is that it only knows how to deal with 128 > different bytes, and about 30 of those are invisible control characters. > The other 128 bytes don't mean anything in ASCII, and you can run into > problems trying to deal with them as text. > > There are hundreds of thousands of useful characters in the world, and > only 128 ASCII ones. Prior to Unicode, people would choose their own > preferred set of 256 useful characters, and semi-arbitrarily assign them > to each of the 256 different bytes. Consequently there was a plethora of > ad hoc encodings where a byte like (say) xC4 might represent (say) > > 'Ä' on Windows computers used in northern and western Europe > '─' on computers in Greece > 'ƒ' on Macintosh computers in Western Europe > 'ń' on Macintoshes in Eastern Europe > > and so forth. As you can imagine, exchanging files from one machine to > another was a nightmare. This is where Unicode comes in -- in theory, > there is a Unicode character for every useful character in any language > anywhere, including mathematical symbols, dingbats, ancient dead > languages, pictograms, and more. > > BUT files on disk, and in memory, are in bytes, not characters. You need > some way to convert a character string into bytes, and back again. There > are many different ways of doing so, depending on whether you care about > making it as fast as possible, or as efficient as possible, or > compatible with some pre-Unicode character set. And this is where the > idea of encodings come in. You can see a list of supported encodings here: > > http://docs.python.org/library/codecs.html#standard-encodings > > So the idea is, when you have a stream of bytes (say, from reading from > a disk), you have to *decode* those bytes into Unicode text, and to > write that text back again, you have to *encode* it to bytes. > > Now, Python tries to be very conservative: if you don't specify an > encoding, it assumes you want ASCII, the lowest common denominator > encoding that keeps English speakers happy. Lucky us. Until we have to > deal with one or more bytes which can't be decoded into ASCII: > > >>> "\xC4".decode('ascii') > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: > ordinal not in range(128) > > Python isn't going to guess what character you want byte C4 to > represent. We've already seen there are at least four different choices. > You have to tell it which one you mean: > > >>> print unicode("\xC4", encoding='macroman') > ƒ > > > Must-read article: > http://www.joelonsoftware.com/articles/Unicode.html
A very interesting explanation! Thanks. > > > > > > -- > Steven > _______________________________________________ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > http://mail.python.org/mailman/listinfo/tutor > -- Have a great day, Alex (msg sent from GMail website) mehg...@gmail.com; http://www.facebook.com/mehgcap _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor