>i understand that django's architecture should use unicode because it's
>the better way, but from the "outside"... what functionality is not
>working fine with non-english characters?

There are loads of things that don't work - actually anything that has
a notion of a character but get's fed with a bytestring won't work. On
the surface these are things like .upper() and .lower() not working for
those languages that do support them - like German, where the normal
chars are uppered, but the umlauts 'äöü' aren't. Other things that
aren't directly visible that don't work are things like getting the
correct length of a string - you get the count of bytes, not the count
of chars (although without normalization of unicode strings, this won't
work by just switching to unicode - and even with normalization there
are edge cases where this doesn't work, because in Unicode it's not one
char == one codepoint, but with utf-8 it's much worse). Same things
that don't work are using regexps with unicode char classes or pulling
out defined char indexes or replacing single chars - you would have to
turn the string into unicode for any of those things to work.

Other things that are problematic: utf-8 errors are thrown in a rather
"lazy" fashion - they aren't thrown when the exact error actually
occurs, like when reading the data, but are thrown when there is some
unicode conversion happening. For example bad data in your database
would only produce problems when you try to use the feed generator, as
that one seems to rely on unicode, at least partially.

And of course most standard Python libs just won't use their unicode
capabilities, because they never see unicode, but only get fed
bytestrings. I already mentioned the regexp lib and the string methods,
but there are much more things that can make good use of unicode, like
the whole XML stuff - currently you are up to your  own to turn the
unicode strings returned from those into utf-8 bytestrings, because
otherwise the Django core will become upset. With switching to full
unicode internally, we will work much better with the standard lib,
because the standard lib already does prefer a full-unicode
environment.

bye, Georg

Reply via email to