Re: [Python-Dev] urllib unicode handling

2008-05-07 Thread Tom Pinckney
I was assuming urllib.quote/unquote would only be called on text intended to be used in non-hostname portions of the URIs. I'm not sure if this is the actual intent of urllib.quote and perhaps the documentation should be updated to specify what precisely it does and then peopel can decide w

Re: [Python-Dev] urllib unicode handling

2008-05-07 Thread Martin v. Löwis
> If this is indeed the case, it sounds perfectly legal (according to the > RFC) and perfectly practical (as required by numerous popular websites) > to have urllib.quote and urllib.quote_plus do an automatic UTF-8 > encoding of unicode strings before percent encoding them. It's probably legal, bu

Re: [Python-Dev] urllib unicode handling

2008-05-07 Thread Martin v. Löwis
> Maybe I didn't understand the RFC quite right, but it seemed like how to > handle hostnames was left as a choice between IDNA encoding the hostname > or replacing the non-ascii characters with dashes? I guess in practice > IDNA is the right decision. I haven't fully understood it, either, but I

Re: [Python-Dev] urllib unicode handling

2008-05-07 Thread Tom Pinckney
Maybe I didn't understand the RFC quite right, but it seemed like how to handle hostnames was left as a choice between IDNA encoding the hostname or replacing the non-ascii characters with dashes? I guess in practice IDNA is the right decision. Another part I wasn't clear on is whether urll

Re: [Python-Dev] urllib unicode handling

2008-05-07 Thread Robert Brewer
"Martin v. Löwis" wrote: > The proper way to implement this would be IRIs (RFC 3987), > in particular section 3.1. This is not as simple as just > encoding it as UTF-8, as you might have to apply IDNA to > the host part. > > Code doing so just hasn't been contributed yet. But if someone wanted to

Re: [Python-Dev] urllib unicode handling

2008-05-07 Thread Tom Pinckney
I may be missing something, but it seems that RFC 3987 (which is about IRIs) basically says: 1) IRIs are identical to URIs except they may have unicode characters in them 2) IRIs must be converted to URIs before being used in HTTP 3) The way to convert IRIs to URIs is to UTF-8 encode the uni

Re: [Python-Dev] urllib unicode handling

2008-05-07 Thread Kristján Valur Jónsson
> -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf > Of Jeroen Ruigrok van der Werven > Sent: Wednesday, May 07, 2008 05:20 > To: Tom Pinckney > Cc: python-dev@python.org > Subject: Re: [Python-Dev] urllib unicode handling >

Re: [Python-Dev] urllib unicode handling

2008-05-07 Thread Armin Ronacher
Hi, Jeroen Ruigrok van der Werven in-nomine.org> writes: > Would people object if such functionality got added to urllib? I would ;-) There are IRIs, just that nobody wrote a useful module for that. There are algorithms in the RFC that can convert URIs to IRIs and the other way round. IMO tha

Re: [Python-Dev] urllib unicode handling

2008-05-06 Thread Jeroen Ruigrok van der Werven
-On [20080507 04:06], Tom Pinckney ([EMAIL PROTECTED]) wrote: >While in theory UTF-8 is not a standard, sites like Last.fm, Facebook and >Wikipedia seem to have embraced it (as have pretty much all other major web >sites). As with HTML, there is what the standard says and what the actual >browse

Re: [Python-Dev] urllib unicode handling

2008-05-06 Thread Martin v. Löwis
> Thanks for any thoughts on this, The proper way to implement this would be IRIs (RFC 3987), in particular section 3.1. This is not as simple as just encoding it as UTF-8, as you might have to apply IDNA to the host part. Code doing so just hasn't been contributed yet. Regards, Martin _