Thomas Armstrong wrote:
> I'm trying to parse a UTF-8 document with special characters like
> acute-accent vowels:
> --------
> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
> ...
> -------
>
> But I get this error message:
> -------
> UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in
> position 122: ordinal not in range(128)
> -------
> It works, but I don't want to substitute each special character, because there
> are always forgotten ones which can crack the program.
if you really want to use latin-1 in the database, and you don't mind dropping
unsupported characters, you can use
text_extrated = text_extrated.encode('iso-8859-1', 'replace')
or
text_extrated = text_extrated.encode('iso-8859-1', 'ignore')
a better approach is of course to convert your database to use UTF-8 and use
text_extrated = text_extrated.encode('utf-8')
it's also a good idea to switch to parameter substitution in your SQL queries:
cursor.execute ("update ... set text = %s where id = %s", text_extrated, id)
it's possible that your database layer can automatically encode unicode strings
if
you pass them in as parameters; see the database API documentation for details.
</F>
--
http://mail.python.org/mailman/listinfo/python-list