[Python-Dev] urllib unicode handling

Tom Pinckney Tue, 06 May 2008 19:06:27 -0700

Hi,

While trying to use urllib in python 2.5.1 to HTTP GET content fromvarious web sites, I've run into a problem with urllib.quote(and .quote_plus): they don't accept unicode strings.


I see that this is an issue that has been discussed before:

        see this thread: 
http://mail.python.org/pipermail/python-dev/2006-July/067248.html
        especially this post: 
http://mail.python.org/pipermail/python-dev/2006-July/067335.html

While I don't really want to re-open a can of worms, it seems that thecurrent implementation of urllib.quote and urllib.quote_plus ispainfully incompatible with how the web (circa 2008) actually works.While the standards may say there is no official way to representunicode strings in URLs, in practice the world uses UTF-8 quiteheavily. For example, I found the following URLs in Google prettyquickly by looking for percent encoded utf-8 encoded accented e's.


        http://www.last.fm/music/Jos%C3%A9+Gonz%C3%A1lez
        http://en.wikipedia.org/wiki/Joseph_Fouch%C3%A9
        
http://apps.facebook.com/ilike/artist/Jos%C3%A9+Gonz%C3%A1lez/track/Stay+In+The+Shade?apv=1

While in theory UTF-8 is not a standard, sites like Last.fm, Facebookand Wikipedia seem to have embraced it (as have pretty much all othermajor web sites). As with HTML, there is what the standard says andwhat the actual browsers have to accept in order to work in the realworld.

urllib.urlencode already converts unicode characters to their UTF-8representation before percent encoding them. Why not urllib.quote andurllib.quote_plus?


Thanks for any thoughts on this,

Tom








_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] urllib unicode handling

Reply via email to