On 10 Jan 2006, at 20:35, Antonio Cavedoni wrote:

Maybe we could start a “unicode” branch right after “magic-removal” is merged back into the trunk?

+1, sounds smart.

I've been bitten by unicode problems a bunch of times while using Django and I'm not even trying to build an internationalized site. For anyone who thinks this stuff isn't relevant to them, consider this: if you write a site that does anything with data from RSS feeds the chances are you will need to consume and process unicode of some sort.

The Flickr Web Services API states the following with regards to unicode:

"""
The Flickr API expects all data to be UTF-8 encoded.

Checks are made for valid UTF-8 sequences. If an invalid sequence is found, the data is presumed to be ISO-8859-1 and converted accordingly to UTF-8.

Sending data in any other encoding will result in garbage into Flickr. It wont be dangerous garbage (we will always store valid UTF-8) but it will still be garbage.

""" http://www.flickr.com/services/api/misc.encoding.html

This approach (assuming anything that is invalid UTF-8 is actually ISO-8859-1 and converting it) appears to work extremely well. We should do this for Django, though we should look at any provided charset data and convert based on that initially (and only assume ISO-8859-1 if that fails).

Cheers,

Simon


Reply via email to