What's the best way to implement unicode slugification? I have an application that expects Chinese characters (with some English mixed in) in user inputs. What's the best way to turn that Chinese-English mix into a slugified URL that's SEO-friendly as well as meaningful to humans?
On Sep 13, 10:01 pm, James Bennett <[email protected]> wrote: > On Sun, Sep 13, 2009 at 8:25 PM, W.P. McNeill <[email protected]> wrote: > > Is this expected behavior? I can see some discussion on the web that > > references unicode support for slugification, but I can't tell if that > > unicode support works for any arbitrary unicode characters, or Django > > has hand-crafted slugification for certain non-ASCII characters (e.g. > > common European characters). > > When in doubt, look at the source. The 'slugify' template filter is > implemented as, well, a template filter, and so lives with all the > other built-in filters in django.template.defaultfilters: > > http://code.djangoproject.com/browser/django/trunk/django/template/de... > > It's easy to see from the code what's going on. A Unicode string comes > in to the filter, and is normalized (using form NFKD) and encoded as > ASCII, ignoring non-convertible characters. Then any character which > is neither a space, a hyphen nor an alphanumeric character is > stripped, as is leading and trailing whitespace. Finally, spaces are > replaced with hyphens. > > The result is something which will be usable in a URL, regardless of > the exotic characters which went into it. However, this does have the > possibility of discarding information, in a couple of places. > > First, the Unicode normalization and ASCII conversion is important -- > NFKD decomposes characters, and then the ASCII encode discards > anything that can't be converted. So, for example, if the character > 'ñ' is in the string, the NFKD normalization decomposes it into 'n' > and a combining diacritic, and then the ASCII conversion with the > 'ignore' flag discards the diacritic. For a URL, this is typically > what you want, because it means 'ñ' becomes simply 'n'. > > The other place where you can lose characters is in discarding > non-alphanumeric characters, but again for a URL this is typically > what you want. > > -- > "Bureaucrat Conrad, you are technically correct -- the best kind of correct." --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/django-users?hl=en -~----------~----~----~----~------~----~------~--~---

