Re: [Cython] OpenMP support

2011-03-10 Thread Robert Bradshaw
On Tue, Mar 8, 2011 at 11:16 AM, Francesc Alted  wrote:
> A Tuesday 08 March 2011 18:50:15 Stefan Behnel escrigué:
>> mark florisson, 08.03.2011 18:00:
>> > What I meant was that the
>> > wrapper returned by the decorator would have to call the closure
>> > for every iteration, which introduces function call overhead.
>> >
>> >[...]
>> >
>> > I guess we just have to establish what we want to do: do we
>> > want to support code with Python objects (and exceptions etc), or
>> > just C code written in Cython?
>>
>> I like the approach that Sturla mentioned: using closures to
>> implement worker threads. I think that's very pythonic. You could do
>> something like this, for example:
>>
>>      def worker():
>>          for item in queue:
>>              with nogil:
>>                  do_stuff(item)
>>
>>      queue.extend(work_items)
>>      start_threads(worker, count)
>>
>> Note that the queue is only needed to tell the thread what to work
>> on. A lot of things can be shared over the closure. So the queue may
>> not even be required in many cases.
>
> I like this approach too.  I suppose that you will need to annotate the
> items so that they are not Python objects, no?  Something like:
>
>     def worker():
>         cdef int item  # tell that item is not a Python object!
>         for item in queue:
>             with nogil:
>                 do_stuff(item)
>
>     queue.extend(work_items)
>     start_threads(worker, count)

On a slightly higher level, are we just trying to use OpenMP from
Cython, or are we trying to build it into the language? If the former,
it may make sense to stick closer than one might otherwise be tempted
in terms of API to the underlying C to leverage the existing
documentation. A library with a more Pythonic interface could perhaps
be written on top of that. Alternatively, if we're building it into
Cython itself, I'd it might be worth modeling it after the
multiprocessing module (though I understand it would be implemented
with threads), which I think is a decent enough model for managing
embarrassingly parallel operations. The above code is similar to that,
though I'd prefer the for loop implicit rather than as part of the
worker method (or at least as an argument). If we went this route,
what are the advantages of using OpenMP over, say, pthreads in the
background? (And could the latter be done with just a library + some
fancy GIL specifications?) One thing that's nice about OpenMP as
implemented in C is that the serial code looks almost exactly like the
parallel code; the code at http://wiki.cython.org/enhancements/openmp
has this property too.

Also, I like the idea of being able to hold the GIL by the invoking
thread and having the "sharing" threads do the appropriate locking
among themselves when needed if possible, e.g. for exception raising.

Another thought I had is, there might be other usecases for being able
to emit generic pragmas statements, how far would that get us?

- Robert
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] OpenMP support

2011-03-10 Thread Stefan Behnel

Robert Bradshaw, 11.03.2011 01:46:

On Tue, Mar 8, 2011 at 11:16 AM, Francesc Alted  wrote:

A Tuesday 08 March 2011 18:50:15 Stefan Behnel escrigué:

mark florisson, 08.03.2011 18:00:

What I meant was that the
wrapper returned by the decorator would have to call the closure
for every iteration, which introduces function call overhead.

[...]

I guess we just have to establish what we want to do: do we
want to support code with Python objects (and exceptions etc), or
just C code written in Cython?


I like the approach that Sturla mentioned: using closures to
implement worker threads. I think that's very pythonic. You could do
something like this, for example:

  def worker():
  for item in queue:
  with nogil:
  do_stuff(item)

  queue.extend(work_items)
  start_threads(worker, count)

Note that the queue is only needed to tell the thread what to work
on. A lot of things can be shared over the closure. So the queue may
not even be required in many cases.


I like this approach too.  I suppose that you will need to annotate the
items so that they are not Python objects, no?  Something like:

 def worker():
 cdef int item  # tell that item is not a Python object!
 for item in queue:
 with nogil:
 do_stuff(item)

 queue.extend(work_items)
 start_threads(worker, count)


On a slightly higher level, are we just trying to use OpenMP from
Cython, or are we trying to build it into the language? If the former,
it may make sense to stick closer than one might otherwise be tempted
in terms of API to the underlying C to leverage the existing
documentation. A library with a more Pythonic interface could perhaps
be written on top of that. Alternatively, if we're building it into
Cython itself, I'd it might be worth modeling it after the
multiprocessing module (though I understand it would be implemented
with threads), which I think is a decent enough model for managing
embarrassingly parallel operations.


+1



The above code is similar to that,
though I'd prefer the for loop implicit rather than as part of the
worker method (or at least as an argument).


It provides a simple way to write per-thread initialisation code, though. 
And it's likely easier to make looping fast than to speed up the call into 
a closure. However, eventually, both ways will need to be supported anyway.




If we went this route,
what are the advantages of using OpenMP over, say, pthreads in the
background? (And could the latter be done with just a library + some
fancy GIL specifications?)


In the above example, basically everything is explicit and nothing more 
than a simplified threading setup is needed. Even the implementation of 
"start_threads()" could be done in a couple of lines of Python code, 
including the collection of results and errors. If someone thinks we need 
more than that, I'd like to see a couple of concrete use cases and code 
examples first.




One thing that's nice about OpenMP as
implemented in C is that the serial code looks almost exactly like the
parallel code; the code at http://wiki.cython.org/enhancements/openmp
has this property too.


Writing it with a closure isn't really that much different. You can put the 
inner function right where it would normally get executed and add a bit of 
calling/load distributing code below it. Not that bad IMO.


It may be worth providing some ready-to-use decorators to do the load 
balancing, but I don't really like the idea of having a decorator magically 
invoke the function in-place that it decorates.




Also, I like the idea of being able to hold the GIL by the invoking
thread and having the "sharing" threads do the appropriate locking
among themselves when needed if possible, e.g. for exception raising.


I like the explicit "with nogil" block in my example above. It makes it 
easy to use normal Python setup code, to synchronise based on the GIL if 
desired (e.g. to use a normal Python queue for communication), and it's 
simple enough not to get in the way.


I think it simplifies things a lot when code can rely on the GIL being held 
when entering the thread function. Threading is complicated enough to keep 
it as explicit as possible.


Stefan
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel