Right, sorry -- I'm gonna have to go with Eric on that, there are builtin libraries that do just that (from unicodedata import normalize).
J. Leclanche / Adys On Thu, Dec 31, 2009 at 1:30 AM, James Bennett <ubernost...@gmail.com>wrote: > On Wed, Dec 30, 2009 at 5:05 PM, Jerome Leclanche <adys...@gmail.com> > wrote: > > When truncating characters, we are obviously talking about truncating > just > > that: characters. Truncating bytes is a behaviour implemented by |slice. > > You misunderstand: I'm not talking about bytes, I'm talking about > composed and decomposed characters. > > For example, 'ΓΌ' can be represented as either: > > 1. 00fc (LATIN SMALL LETTER U WITH DIARESIS), or > > 2. 0075 (LATIN SMALL LETTER U) *followed by* 0308 (COMBINING DIARESIS) > > Option 1 is composed, option 2 is decomposed and is actually *two > Unicode characters*, not "two bytes", and so character-based slicing > will chop off the combining diaresis. The only way to avoid this is to > have the filter do Unicode normalization to composed characters (e.g., > normalization form NFC or NFKC). > > > -- > "Bureaucrat Conrad, you are technically correct -- the best kind of > correct." > > -- > > You received this message because you are subscribed to the Google Groups > "Django developers" group. > To post to this group, send email to django-develop...@googlegroups.com. > To unsubscribe from this group, send email to > django-developers+unsubscr...@googlegroups.com<django-developers%2bunsubscr...@googlegroups.com> > . > For more options, visit this group at > http://groups.google.com/group/django-developers?hl=en. > > > -- You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-develop...@googlegroups.com. To unsubscribe from this group, send email to django-developers+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.