On 10 Jan 2006, at 20:35, Antonio Cavedoni wrote:
Maybe we could start a “unicode” branch right after “magic-removal”
is merged back into the trunk?
+1, sounds smart.
I've been bitten by unicode problems a bunch of times while using
Django and I'm not even trying to build an internationalized site.
For anyone who thinks this stuff isn't relevant to them, consider
this: if you write a site that does anything with data from RSS feeds
the chances are you will need to consume and process unicode of some
sort.
The Flickr Web Services API states the following with regards to
unicode:
"""
The Flickr API expects all data to be UTF-8 encoded.
Checks are made for valid UTF-8 sequences. If an invalid sequence is
found, the data is presumed to be ISO-8859-1 and converted
accordingly to UTF-8.
Sending data in any other encoding will result in garbage into
Flickr. It wont be dangerous garbage (we will always store valid
UTF-8) but it will still be garbage.
""" http://www.flickr.com/services/api/misc.encoding.html
This approach (assuming anything that is invalid UTF-8 is actually
ISO-8859-1 and converting it) appears to work extremely well. We
should do this for Django, though we should look at any provided
charset data and convert based on that initially (and only assume
ISO-8859-1 if that fails).
Cheers,
Simon