[Python-Dev] urllib unicode handling

2008-05-06 Thread Tom Pinckney

Hi,

While trying to use urllib in python 2.5.1 to HTTP GET content from  
various web sites, I've run into a problem with urllib.quote  
(and .quote_plus): they don't accept unicode strings.


I see that this is an issue that has been discussed before:

see this thread: 
http://mail.python.org/pipermail/python-dev/2006-July/067248.html
especially this post: 
http://mail.python.org/pipermail/python-dev/2006-July/067335.html

While I don't really want to re-open a can of worms, it seems that the  
current implementation of urllib.quote and urllib.quote_plus is  
painfully incompatible with how the web (circa 2008) actually works.  
While the standards may say there is no official way to represent  
unicode strings in URLs, in practice the world uses UTF-8 quite  
heavily. For example, I found the following URLs in Google pretty  
quickly by looking for percent encoded utf-8 encoded accented e's.


http://www.last.fm/music/Jos%C3%A9+Gonz%C3%A1lez
http://en.wikipedia.org/wiki/Joseph_Fouch%C3%A9

http://apps.facebook.com/ilike/artist/Jos%C3%A9+Gonz%C3%A1lez/track/Stay+In+The+Shade?apv=1

While in theory UTF-8 is not a standard, sites like Last.fm, Facebook  
and Wikipedia seem to have embraced it (as have pretty much all other  
major web sites). As with HTML, there is what the standard says and  
what the actual browsers have to accept in order to work in the real  
world.


urllib.urlencode already converts unicode characters to their UTF-8  
representation before percent encoding them. Why not urllib.quote and  
urllib.quote_plus?


Thanks for any thoughts on this,

Tom








___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib unicode handling

2008-05-07 Thread Tom Pinckney
I may be missing something, but it seems that RFC 3987 (which is about  
IRIs) basically says:


1) IRIs are identical to URIs except they may have unicode characters  
in them

2) IRIs must be converted to URIs before being used in HTTP
3) The way to convert IRIs to URIs is to UTF-8 encode the unicode  
characters in the IRI and then percent encode the resulting octects  
that are unsafe to have in a URI
4) There's some ambiguity over what to do with the hostname portion of  
the URI if it hash one (IDN, replace non-ascii characters with dashes  
etc)


If this is indeed the case, it sounds perfectly legal (according to  
the RFC) and perfectly practical (as required by numerous popular  
websites) to have urllib.quote and urllib.quote_plus do an automatic  
UTF-8 encoding of unicode strings before percent encoding them.


It's not entirely clear to me if people should be calling urllib.quote  
on hostnames and expecting them to be encoded properly if the hostname  
contains non-ascii characters. Perhaps the docs should be clarified on  
this matter?


Similarly, urllib.unquote should precent-decode characters and then  
attempt to convert the resulting octects from utf-8 to unicode. If  
that conversion fails, we can assume the octects should be returned as  
a byte string rather than a unicode string.


On May 7, 2008, at 8:12 AM, Armin Ronacher wrote:


Hi,

Jeroen Ruigrok van der Werven  in-nomine.org> writes:


Would people object if such functionality got added to urllib?
I would ;-)  There are IRIs, just that nobody wrote a useful module  
for that.
There are algorithms in the RFC that can convert URIs to IRIs and  
the other way

round.  IMO that's the way to go.

Regards,
Armin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/thomaspinckney3%40gmail.com


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib unicode handling

2008-05-07 Thread Tom Pinckney
Maybe I didn't understand the RFC quite right, but it seemed like how  
to handle hostnames was left as a choice between IDNA encoding the  
hostname or replacing the non-ascii characters with dashes? I guess in  
practice IDNA is the right decision.


Another part I wasn't clear on is whether urllib.quote() understands  
it's working on URIs, arbitrary strings, URLs or what. It seems that  
from the documentation it looks like it's expecting to just work on  
the path component of URLs. If this is so, then it doesn't need to  
understand what to do if the IRI contains a hostname.


Seems like the other somewhat under-specified part of all of this is  
how urllib.unquote() should work. If after percent decoding it sees  
non-ascii octets, should it try to decode them as utf-8 and if that  
fails then leave them as is?


On May 7, 2008, at 11:55 AM, Robert Brewer wrote:


"Martin v. Löwis" wrote:

The proper way to implement this would be IRIs (RFC 3987),
in particular section 3.1. This is not as simple as just
encoding it as UTF-8, as you might have to apply IDNA to
the host part.

Code doing so just hasn't been contributed yet.


But if someone wanted to do so, it's pretty simple:


u'www.\u212bngstr\xf6m.com'.encode("idna")

'www.xn--ngstrm-hua5l.com'


Robert Brewer
[EMAIL PROTECTED]



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib unicode handling

2008-05-07 Thread Tom Pinckney
I was assuming urllib.quote/unquote would only be called on text  
intended to be used in non-hostname portions of the URIs. I'm not sure  
if this is the actual intent of urllib.quote and perhaps the  
documentation should be updated to specify what precisely it does and  
then peopel can decide what parts of URIs it is appropriate to quote/ 
unquote. I don't believe quote/unquote does anything sensical with  
hostnames today that contain non-printable ascii, so this is no loss  
of existing functionality.


Re your suggestion that IRIs should be a separate module: I guess my  
thought is that urllib out of the box should just work with the way  
websites on the web today actually work. Thus, we should make urllib  
do the utf-8 encode / decode rather than make users switch to a  
different module for certain URLs and another library for other URLs.


Re the specific issue of how urllib.unquote should work: Perhaps there  
could be an optional second argument that specified a content encoding  
to use when decoding escaped characters? I would propose that this  
parameter have a default value of utf-8 since that is what most  
websites seem to do, but if the author knew that the website they were  
using encoded URLs in iso-8559 then they could unquote using that  
scheme.


On May 7, 2008, at 3:10 PM, Martin v. Löwis wrote:

If this is indeed the case, it sounds perfectly legal (according to  
the
RFC) and perfectly practical (as required by numerous popular  
websites)

to have urllib.quote and urllib.quote_plus do an automatic UTF-8
encoding of unicode strings before percent encoding them.


It's probably legal, but I don't understand why you think it's
practical. The DNS lookup then will certainly fail, no?

Regards,
Martin


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Copying cgi.parse_qs() to the urllib.parse module

2008-05-12 Thread Tom Pinckney
Is there any thought to extending escape to escape / unescape to by  
default handle characters other than <, >, and &? At a minimum it  
should handle arbitrary &xxx; values. Ideally, it would also handle  
common other symbolic names besides < > etc.


HTML from common web sites such as nytimes.com frequently has a  
variety of characters escaped.


Consider the page at 
http://travel.nytimes.com/travel/guides/europe/france/provence-and-the-french-riviera/overview.html

It lists its content type as:
content="text/html; charset=UTF-8"
And contains text like:
There’s the Côte d’
Ideally, we would decode ’ into ’ and ô into ô.
Unfortunately, #146 is really an error -- it's not a utf-8 encoded  
unicode character but really a MS codepage 1252 character for  
apostrophe (apparently may HTML editing systems intermingle unicode  
and codepage 1252 content for apostrophes and a few other common  
characters).
I'm happy to contribute some additional code for these other cases if  
people agree it's useful.




On May 12, 2008, at 10:36 AM, Tony Nelson wrote:


At 11:56 PM -0400 5/10/08, Fred Drake wrote:

On May 10, 2008, at 11:49 PM, Guido van Rossum wrote:

Works for me. The other thing I always use from cgi is escape() --
will that be available somewhere else too?



xml.sax.saxutils.escape() would be an appropriate replacement, though
the location is a little funky.


At least it's right next to the valuable quoteattr().
--

TonyN.:'   
 '  
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/thomaspinckney3%40gmail.com


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Addition of "pyprocessing" module to standard lib.

2008-05-13 Thread Tom Pinckney
Why not use MPI? It's cross platform, cross language and very widely  
supported already. And there're Python bindings already.


On May 13, 2008, at 8:52 PM, Jesse Noller wrote:


I am looking for any questions, concerns or benchmarks python-dev has
regarding the possible inclusion of the pyprocessing module to the
standard library - preferably in the 2.6 timeline.  In March, I began
working on the PEP for the inclusion of the pyprocessing (processing)
module into the python standard library[1]. The original email to the
stdlib-sig can be found here, it includes a basic overview of the
module:

http://mail.python.org/pipermail/stdlib-sig/2008-March/000129.html

The processing module mirrors/mimics the API of the threading module -
and with simple import/subclassing changes depending on the code,
allows you to leverage multi core machines via an underlying forking
mechanism. The module also supports the sharing of data across groups
of networked machines - a feature obviously not part of the core
threading module, but useful in a distributed environment.

As I am trying to finish up the PEP, I want to see if I can address
any questions or include any other useful data (including benchmarks)
in the PEP prior to publishing it. I am also intending to include
basic benchmarks for both the processing module against the threading
module as a comparison.

-Jesse

[1] Processing page: http://pyprocessing.berlios.de/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/thomaspinckney3%40gmail.com


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Addition of "pyprocessing" module to standard lib.

2008-05-14 Thread Tom Pinckney


On May 14, 2008, at 12:32 PM, Andrew McNabb wrote:




Think of the processing module as an alternative to the threading
module, not as an alternative to MPI.  In Python, multi-threading  
can be

extremely slow.  The processing module gives you a way to convert from
using multiple threads to using multiple processes.

If it made people feel better, maybe it should be called threading2
instead of multiprocessing.  The word "processing" seems to make  
people

think of parallel processing and clusters, which is missing the point.

Anyway, I would love to see the processing module included in the
standard library.



Is the goal of the pyprocessing module to be exactly drop in  
compatible with threading, Queue and friends? I guess the idea would  
be that if my problem is compute bound I'd use pyprocessing and if it  
was I/O bound I might just use the existing threading library?


Can I write a program using only threading and Queue interfaces for  
inter-thread communication and just change my import statements and  
have my program work? Currently, it looks like the pyprocessing.Queue  
interface is slightly different than than Queue.Queue for example (no  
task_done() etc).


Perhaps a stdlib version of pyprocessing could be simplified down to  
not try to be a cross-machine computing environment and just be a same- 
machine threading replacement? This would make the maintenance easier  
and reduce confusion with users about how they should do cross-machine  
multiprocessing.


By the way, a thread-safe simple dict in the style of Queue would be  
extremely helpful in writing multi-threaded programs, whether using  
the threading or pyprocessing modules.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Python parallel benchmark

2008-05-15 Thread Tom Pinckney
All the discussion recently about pyprocessing got me interested in  
actually benchmarking Python's multiprocessing performance to see if  
reality matched my expectations around what would scale up and what  
would not. I knew Python threads wouldn't be good for compute bound  
problems, but I was curious to see how well they worked for i/o bound  
problems. The short answer is that for i/o bound problems, python  
threads worked just as well as using multiple operating system  
processes.


I wrote two simple benchmarks, one compute bound and the other i/o  
bound. The compute bound one did a parallel matrix multiply and the i/ 
o bound one read random records from a remote MySQL database. I ran  
each benchmark via python's thread module and via MPI (using mpi4py  
and openmpi and Send()/Recv() for communication). Each test was run  
multiple times and the numbers were consistent between test runs. I  
ran the tests on a dual-core Macbook Pro running OS X 10.5 and the  
included python 2.5.1.


1) Python threads

a) compute bound:

1 thread - 16 seconds
2 threads - 21 seconds

b) i/o bound:

1 thread -- 13 seconds
4 threads -- 10 seconds
8 threads -- 5 seconds
12 threads - 4 seconds

2) MPI

a) compute bound:

1 thread - 17 seconds
2 threads -- 11 seconds

b) i/o bound

1 thread -- 13 seconds
4 threads -- 10 seconds
8 threads -- 6 seconds
12 threads -- 4 seconds
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python parallel benchmark

2008-05-15 Thread Tom Pinckney
I switched to using numpy for the matrix multiply and while the  
overall time to do the matrix multiply is much faster, there is still  
no speed up from using more than one python thread. If I look at top  
while running 2 or more threads, both cores are being used 100% and  
there is no idle time on the system.


I did a quick google search and didn't find anything conclusive about  
numpy releasing the GIL. The most conclusive and recent reference I  
found was


http://mail.python.org/pipermail/python-list/2007-October/463148.html

I found some other references where people were expressing concern  
over numpy releasing the GIL due to the fact that other C extensions  
could call numpy and unexpectedly have the GIL released on them (or  
something like that).


On May 15, 2008, at 6:43 PM, Nick Coghlan wrote:


Tom Pinckney wrote:
All the discussion recently about pyprocessing got me interested in  
actually benchmarking Python's multiprocessing performance to see  
if reality matched my expectations around what would scale up and  
what would not. I knew Python threads wouldn't be good for compute  
bound problems, but I was curious to see how well they worked for i/ 
o bound problems. The short answer is that for i/o bound problems,  
python threads worked just as well as using multiple operating  
system processes.


Interesting - given that your example compute bound problem happened  
to be a matrix multiply, I'd be curious what the results are when  
using python threads with numpy to do the same thing (my  
understanding is that numpy will usually release the GIL while doing  
serious number-crunching)


Cheers,
Nick.

--
Nick Coghlan   |   [EMAIL PROTECTED]   |   Brisbane, Australia
---
   http://www.boredomandlaziness.org


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python parallel benchmark

2008-05-15 Thread Tom Pinckney
Interestingly, I think there's something magic going on with  
numpy.dot() on my mac.


If I just run a program without threading--that is just a numpy matrix  
multiply such as:


import numpy
a = numpy.empty((4000,4000))
b = numpy.empty((4000,4000))
c = numpy.dot(a,b)

then I see both cores fully maxed out on my mac. On a dual-core linux  
machine I see only one core maxed out by this program and it runs  
VASTLY slower on the linux box.


It turns out that numpy on Mac's uses Apple's Accelerate.framekwork  
BLAS and LAPACK which in turn is multi-threaded as of OS X 10.4.8.


On May 15, 2008, at 10:55 PM, Greg Ewing wrote:


Tom Pinckney wrote:
If I look at top while running 2 or more threads, both cores are  
being used 100% and there is no idle time on the system.


If you run it with just one thread, does it use up only
one core's worth of CPU?

If so, this suggests that the GIL is being released. If
it wasn't, two threads would still only use one core's worth.

Also, if you have only two cores, using more than two
threads isn't going to gain you anything whatever happens.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/thomaspinckney3%40gmail.com


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python parallel benchmark

2008-05-16 Thread Tom Pinckney

Here's one example, albeit from a few years ago

http://aspn.activestate.com/ASPN/Mail/Message/numpy-discussion/1625465

But, I am a numpy novice and so no idea what it actually does in its  
current form.


On May 16, 2008, at 4:17 AM, Hrvoje Nik?i? wrote:


On Thu, 2008-05-15 at 21:02 -0400, Tom Pinckney wrote:

I found some other references where people were expressing concern
over numpy releasing the GIL due to the fact that other C extensions
could call numpy and unexpectedly have the GIL released on them (or
something like that).


Could you please post links to those?  I'm asking because AFAIK that
concern doesn't really stand.  Any (correct) code that releases the  
GIL

is responsible for reacquiring it before calling *any* Python code, in
fact before doing anything that might touch a Python object or its
refcount.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/thomaspinckney3%40gmail.com


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com