What's the best way to implement unicode slugification?

I have an application that expects Chinese characters (with some
English mixed in) in user inputs. What's the best way to turn that
Chinese-English mix into a slugified URL that's SEO-friendly as well
as meaningful to humans?

On Sep 13, 10:01 pm, James Bennett <[email protected]> wrote:
> On Sun, Sep 13, 2009 at 8:25 PM, W.P. McNeill <[email protected]> wrote:
> > Is this expected behavior?  I can see some discussion on the web that
> > references unicode support for slugification, but I can't tell if that
> > unicode support works for any arbitrary unicode characters, or Django
> > has hand-crafted slugification for certain non-ASCII characters (e.g.
> > common European characters).
>
> When in doubt, look at the source. The 'slugify' template filter is
> implemented as, well, a template filter, and so lives with all the
> other built-in filters in django.template.defaultfilters:
>
> http://code.djangoproject.com/browser/django/trunk/django/template/de...
>
> It's easy to see from the code what's going on. A Unicode string comes
> in to the filter, and is normalized (using form NFKD) and encoded as
> ASCII, ignoring non-convertible characters. Then any character which
> is neither a space, a hyphen nor an alphanumeric character is
> stripped, as is leading and trailing whitespace. Finally, spaces are
> replaced with hyphens.
>
> The result is something which will be usable in a URL, regardless of
> the exotic characters which went into it. However, this does have the
> possibility of discarding information, in a couple of places.
>
> First, the Unicode normalization and ASCII conversion is important --
> NFKD decomposes characters, and then the ASCII encode discards
> anything that can't be converted. So, for example, if the character
> 'ñ' is in the string, the NFKD normalization decomposes it into 'n'
> and a combining diacritic, and then the ASCII conversion with the
> 'ignore' flag discards the diacritic. For a URL, this is typically
> what you want, because it means 'ñ' becomes simply 'n'.
>
> The other place where you can lose characters is in discarding
> non-alphanumeric characters, but again for a URL this is typically
> what you want.
>
> --
> "Bureaucrat Conrad, you are technically correct -- the best kind of correct."
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to