Re: [Cython] Acquisition counted cdef classes

2011-10-25 Thread Stefan Behnel

mark florisson, 24.10.2011 21:50:

This is in response to
http://groups.google.com/group/cython-users/browse_thread/thread/bcbc5fe0e329224f
and http://trac.cython.org/cython_trac/ticket/498 , and some of the
previous discussion on cython.parallel.

Basically I think we should have something more powerful than 'cdef
borrowed CdefClass obj', something that also doesn't rely on new
syntax.


We will still need borrowed reference support in the compiler eventually, 
whether we make it a language feature or not.




What if we support acquisition counting for every instance of a cdef
class? In Python and Cython GIL mode you use reference counting, and
in Cython nogil mode and for structs attributes, array dtypes etc you
use acquisition counting. This allows you to pass around cdef objects
without the GIL and use their nogil methods. If the acquisition count
is greater than 1, the acquisition count owns a reference to the
object. If it reaches 0 you discard your owned reference (you can
simply acquire the GIL if you don't have it) and when you increment
from zero you obtain it. Perhaps something like libatomic could be
used to efficiently implement this.


Where would you store that count? In the object struct? That would increase 
the size of each instance.




The advantages are:

1) allow users to pass around cdef typed objects in nogil mode
2) allow cdef typed objects in as struct attributes or array elements
3) make it easy to implement things like memoryviews (already done but
would have been a lot easier), cython.parallel.async/future objects,
cython.parallel.mutex objects and possibly other things in the future


Would it really be easier? You can already call cdef methods in nogil mode, 
AFAIR.




We should then allow a syntax like

 with mycdefobject:
 ...

to lock the object in GIL or nogil mode (like java's 'synchronized').
For objects that already have __enter__ and __exit__ you could support
something like 'with cython.synchronized(mycdefobject): ...' instead.
Or perhaps you should always require cython.synchronized (or
cython.parallel.synchronized).


The latter, I sure hope.



In addition to nogil methods a user may provide special cdef nogil methods, i.e.

cdef int __len__(self) nogil:
 ...

which would provide a Cython as well as a Python implementation for
the function (with automatic cpdef behaviour), so you could use it in
both contexts.


That can already be done for final types, simply by adding cpdef behaviour 
to all special methods. That would also fix ticket #3, for example.


Note that the DefNode refactoring is still pending, it would help here.



There are two options for assignment semantics to a struct attribute
or array element:
 - decref the old value (this implies always initializing the
pointers to NULL first)
 - don't decref the old value (the user has to manually use 'del')

I think 1) is more definitely consistent with how everything else works.


Yes.



All of this functionality should also get a sane C API (to be provided
by cython.h). You'd get a Cy_INCREF(obj, have_gil)/Cy_DECREF() etc.
Every class using this functionality is a subclass of CythonObject
(that contains a PyObject + an acquisition count + a lock). Perhaps if
the user is subclassing something other than object we could allow the
user to specify custom __cython_(un)lock__ and
__cython_acquisition_count__ methods and fields.

Now, building on top of this functionality, Cython could provide
built-in nogil-compatible types, like lists, dicts and maybe tuples
(as a start). These will by default not lock for operations to allow
e.g. one thread to iterate over the list and another thread to index
it without lock contention and other general overhead. If one thread
is somehow changing the size of the list, or writing to indices that
another thread is reading from/writing to, the results will of course
be undefined unless the user synchronizes on the object. So it would
be the user's responsibility. The acquisition counting itself will
always be thread-safe (i.e., it will be atomic if possible, otherwise
it will lock).

It's probably best to not enable this functionality by default as it
would be more expensive to instantiate objects, but it could be
supported through a cdef class decorator and a general directive.


It's well known that this would be expensive. One of the approaches that 
tried to get rid of the GIL in CPython introduced fine grained locking, and 
it turned out to be substantially slower, AFAIR by a factor of two.


You could potentially drop the locking for local variables, but you'd loose 
that ability as soon as the 'object' is passed into a function.


Basically, what you are trying to do here is to duplicate the complete 
ref-counting infrastructure of CPython, but without using CPython.




Of course one may still use non-cdef borrowed objects, by simply
casting to a PyObject *.


That's very ugly, though, because you loose all access to methods and 
attributes of

Re: [Cython] Acquisition counted cdef classes

2011-10-25 Thread mark florisson
On 25 October 2011 05:47, Robert Bradshaw  wrote:
> On Mon, Oct 24, 2011 at 2:52 PM, mark florisson
>  wrote:
>> On 24 October 2011 22:03, Greg Ewing  wrote:
>>> mark florisson wrote:

 These will by default not lock for operations to allow
 e.g. one thread to iterate over the list and another thread to index
 it without lock contention and other general overhead.
>>>
>>> I don't think that's safe. You can't say "I'm not modifying
>>> this, so I don't need to lock it" because there may be another
>>> thread that *is* in the midst of modifying it.
>>
>> I was really thinking of the case where you instantiate it in Cython
>> and then do some parallel work, in which case you're the only user.
>> But you can't assume that in general.
>
> It could be useful to assert for a chunk of code that a given object
> is read-only and will not be mutated for the duration of the context
> (programmer error and strange crash/data corruption if it is). E.g.
>
> with nogil, assert_frozen(my_dict):
>    a = (my_dict[key]).c_attribute
>    [...]
>
> All references obtained could be borrowed. Perhaps we could even
> enforce this for cdef classes (but perhaps not consistently enough,
> and perhaps that would make things even more confusing). Just a
> thought.

Hmm, I actually think that passing around references in general
(without having to declare them as borrowed in parameters) would be a
good feature. If my_dict would be e.g. a cython.types.dict, then it
would only accept CythonObjects, so it could just do the acquisition
counting.

For cython.parallel we could provide types more suited for the
cython.parallel kind of fine-grained parallelism, e.g. lock for
writes, don't lock for reads, which allows either to happen
simultaneously, but not any mixing of those two. Through explicit or
implicit barriers one may be sure that operations are correct.

> - Robert
> ___
> cython-devel mailing list
> cython-devel@python.org
> http://mail.python.org/mailman/listinfo/cython-devel
>
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] Acquisition counted cdef classes

2011-10-25 Thread mark florisson
On 25 October 2011 08:33, Stefan Behnel  wrote:
> mark florisson, 24.10.2011 21:50:
>>
>> This is in response to
>>
>> http://groups.google.com/group/cython-users/browse_thread/thread/bcbc5fe0e329224f
>> and http://trac.cython.org/cython_trac/ticket/498 , and some of the
>> previous discussion on cython.parallel.
>>
>> Basically I think we should have something more powerful than 'cdef
>> borrowed CdefClass obj', something that also doesn't rely on new
>> syntax.
>
> We will still need borrowed reference support in the compiler eventually,
> whether we make it a language feature or not.
>

I'm not sure I understand why, acquisition counting could solve these
problems for cdef classes, and general objects may not be used without
the GIL. Do you want this as an optimization?

>> What if we support acquisition counting for every instance of a cdef
>> class? In Python and Cython GIL mode you use reference counting, and
>> in Cython nogil mode and for structs attributes, array dtypes etc you
>> use acquisition counting. This allows you to pass around cdef objects
>> without the GIL and use their nogil methods. If the acquisition count
>> is greater than 1, the acquisition count owns a reference to the
>> object. If it reaches 0 you discard your owned reference (you can
>> simply acquire the GIL if you don't have it) and when you increment
>> from zero you obtain it. Perhaps something like libatomic could be
>> used to efficiently implement this.
>
> Where would you store that count? In the object struct? That would increase
> the size of each instance.

Yes, not just the count, also the lock. This feature would be optional
and may be very useful for people (I think).

>
>> The advantages are:
>>
>> 1) allow users to pass around cdef typed objects in nogil mode
>> 2) allow cdef typed objects in as struct attributes or array elements
>> 3) make it easy to implement things like memoryviews (already done but
>> would have been a lot easier), cython.parallel.async/future objects,
>> cython.parallel.mutex objects and possibly other things in the future
>
> Would it really be easier? You can already call cdef methods in nogil mode,
> AFAIR.
>

Sure, but you cannot store cdef objects as struct attributes, array
elements (you could implement it with reference counting, but not for
nogil mode), and you cannot pass them around without the GIL. This
proposal is about making your life easier without the GIL, and
currently it's kind of a pain.

>> We should then allow a syntax like
>>
>>     with mycdefobject:
>>         ...
>>
>> to lock the object in GIL or nogil mode (like java's 'synchronized').
>> For objects that already have __enter__ and __exit__ you could support
>> something like 'with cython.synchronized(mycdefobject): ...' instead.
>> Or perhaps you should always require cython.synchronized (or
>> cython.parallel.synchronized).
>
> The latter, I sure hope.
>
>
>> In addition to nogil methods a user may provide special cdef nogil
>> methods, i.e.
>>
>> cdef int __len__(self) nogil:
>>     ...
>>
>> which would provide a Cython as well as a Python implementation for
>> the function (with automatic cpdef behaviour), so you could use it in
>> both contexts.
>
> That can already be done for final types, simply by adding cpdef behaviour
> to all special methods. That would also fix ticket #3, for example.
>
> Note that the DefNode refactoring is still pending, it would help here.
>

Ah I assumed cpdef nogil was invalid, I see it isn't, cool. This
breaks terribly for special methods though.

>> There are two options for assignment semantics to a struct attribute
>> or array element:
>>     - decref the old value (this implies always initializing the
>> pointers to NULL first)
>>     - don't decref the old value (the user has to manually use 'del')
>>
>> I think 1) is more definitely consistent with how everything else works.
>
> Yes.
>
>
>> All of this functionality should also get a sane C API (to be provided
>> by cython.h). You'd get a Cy_INCREF(obj, have_gil)/Cy_DECREF() etc.
>> Every class using this functionality is a subclass of CythonObject
>> (that contains a PyObject + an acquisition count + a lock). Perhaps if
>> the user is subclassing something other than object we could allow the
>> user to specify custom __cython_(un)lock__ and
>> __cython_acquisition_count__ methods and fields.
>>
>> Now, building on top of this functionality, Cython could provide
>> built-in nogil-compatible types, like lists, dicts and maybe tuples
>> (as a start). These will by default not lock for operations to allow
>> e.g. one thread to iterate over the list and another thread to index
>> it without lock contention and other general overhead. If one thread
>> is somehow changing the size of the list, or writing to indices that
>> another thread is reading from/writing to, the results will of course
>> be undefined unless the user synchronizes on the object. So it would
>> be the user's responsibility. The acquisition counting itsel

Re: [Cython] Acquisition counted cdef classes

2011-10-25 Thread Stefan Behnel

mark florisson, 25.10.2011 11:11:

On 25 October 2011 08:33, Stefan Behnel wrote:

mark florisson, 24.10.2011 21:50:


This is in response to

http://groups.google.com/group/cython-users/browse_thread/thread/bcbc5fe0e329224f
and http://trac.cython.org/cython_trac/ticket/498 , and some of the
previous discussion on cython.parallel.

Basically I think we should have something more powerful than 'cdef
borrowed CdefClass obj', something that also doesn't rely on new
syntax.


We will still need borrowed reference support in the compiler eventually,
whether we make it a language feature or not.


I'm not sure I understand why, acquisition counting could solve these
problems for cdef classes, and general objects may not be used without
the GIL. Do you want this as an optimization?


Yes. Think of type(x), for example, or PyDict_GetItem(). They return 
borrowed references, and in many cases, Cython wouldn't have to INCREF and 
DECREF them when they are only being used as part of some specific kinds of 
expressions. The same applies to some utility functions in Cython that 
currently must INCREF their return value unconditionally, simply because 
they can't tell Cython that they could also return a borrowed reference 
instead. If there was a way to do that, we could optimise the reference 
counting away in a couple of more places, which would get us another bit 
closer to hand-tuned code.


However, note that this doesn't necessarily have an impact on nogil code. 
If you took a borrowed reference in one nogil thread, and a gil-holding 
thread deletes the object at the same time or during the lifetime of the 
borrowed reference (e.g. by updating a dict or assigning to a cdef 
attribute), the nogil thread would end up with a dead pointer in its hands. 
That's why the usage of borrowed references needs to be explicit in the 
code ("I know what I'm doing"), and the optimisations require the GIL to be 
held.




What if we support acquisition counting for every instance of a cdef
class? In Python and Cython GIL mode you use reference counting, and
in Cython nogil mode and for structs attributes, array dtypes etc you
use acquisition counting. This allows you to pass around cdef objects
without the GIL and use their nogil methods. If the acquisition count
is greater than 1, the acquisition count owns a reference to the
object. If it reaches 0 you discard your owned reference (you can
simply acquire the GIL if you don't have it) and when you increment
from zero you obtain it. Perhaps something like libatomic could be
used to efficiently implement this.


Where would you store that count? In the object struct? That would increase
the size of each instance.


Yes, not just the count, also the lock. This feature would be optional
and may be very useful for people (I think).


Well, as long as it's an optional feature that requires a class decorator, 
the only obvious drawback is that it'll bloat the compiler even more than 
it is already.




The advantages are:

1) allow users to pass around cdef typed objects in nogil mode
2) allow cdef typed objects in as struct attributes or array elements
3) make it easy to implement things like memoryviews (already done but
would have been a lot easier), cython.parallel.async/future objects,
cython.parallel.mutex objects and possibly other things in the future


Would it really be easier? You can already call cdef methods in nogil mode,
AFAIR.


Sure, but you cannot store cdef objects as struct attributes, array
elements (you could implement it with reference counting, but not for
nogil mode)


You could do that with borrowed references, though, assuming that you keep 
another reference around (or do your own ref-counting). However, I do see 
that keeping a real reference around may be hard to do in some cases.




and you cannot pass them around without the GIL.


Yes, you can, as long as you only go through cdef functions. Obviously, you 
can't pass them into a Python function call, but you can (and could, if it 
was implemented) do loads of useful things with existing references even in 
nogil sections. The GIL checker is quite fine grained already but could do 
even better.




This
proposal is about making your life easier without the GIL, and
currently it's kind of a pain.


The nogil sections I use are usually quite short, so I can't tell. It's 
certainly a pain to work without the GIL, because it means you have to take 
a lot more care when writing your code. But that won't change just by 
dropping reference counting. And nogil code will definitely become another 
bit harder to get right when using borrowed references.




Ah I assumed cpdef nogil was invalid, I see it isn't, cool.


It makes perfect sense. Just because a function *can* be called without the 
GIL doesn't mean it can't be called from Python. So the Python wrapper 
requires the GIL, but the underlying cdef function doesn't.




This breaks terribly for special methods though.


Why? It's just a matter of properl

Re: [Cython] Acquisition counted cdef classes

2011-10-25 Thread Dag Sverre Seljebotn

On 10/25/2011 09:33 AM, Stefan Behnel wrote:

mark florisson, 24.10.2011 21:50:

This is in response to
http://groups.google.com/group/cython-users/browse_thread/thread/bcbc5fe0e329224f

and http://trac.cython.org/cython_trac/ticket/498 , and some of the
previous discussion on cython.parallel.

Basically I think we should have something more powerful than 'cdef
borrowed CdefClass obj', something that also doesn't rely on new
syntax.


We will still need borrowed reference support in the compiler
eventually, whether we make it a language feature or not.



What if we support acquisition counting for every instance of a cdef
class? In Python and Cython GIL mode you use reference counting, and
in Cython nogil mode and for structs attributes, array dtypes etc you
use acquisition counting. This allows you to pass around cdef objects
without the GIL and use their nogil methods. If the acquisition count
is greater than 1, the acquisition count owns a reference to the
object. If it reaches 0 you discard your owned reference (you can
simply acquire the GIL if you don't have it) and when you increment
from zero you obtain it. Perhaps something like libatomic could be
used to efficiently implement this.


Where would you store that count? In the object struct? That would
increase the size of each instance.



The advantages are:

1) allow users to pass around cdef typed objects in nogil mode
2) allow cdef typed objects in as struct attributes or array elements
3) make it easy to implement things like memoryviews (already done but
would have been a lot easier), cython.parallel.async/future objects,
cython.parallel.mutex objects and possibly other things in the future


Would it really be easier? You can already call cdef methods in nogil
mode, AFAIR.



We should then allow a syntax like

with mycdefobject:
...

to lock the object in GIL or nogil mode (like java's 'synchronized').
For objects that already have __enter__ and __exit__ you could support
something like 'with cython.synchronized(mycdefobject): ...' instead.
Or perhaps you should always require cython.synchronized (or
cython.parallel.synchronized).


The latter, I sure hope.



In addition to nogil methods a user may provide special cdef nogil
methods, i.e.

cdef int __len__(self) nogil:
...

which would provide a Cython as well as a Python implementation for
the function (with automatic cpdef behaviour), so you could use it in
both contexts.


That can already be done for final types, simply by adding cpdef
behaviour to all special methods. That would also fix ticket #3, for
example.

Note that the DefNode refactoring is still pending, it would help here.



There are two options for assignment semantics to a struct attribute
or array element:
- decref the old value (this implies always initializing the
pointers to NULL first)
- don't decref the old value (the user has to manually use 'del')

I think 1) is more definitely consistent with how everything else works.


Yes.



All of this functionality should also get a sane C API (to be provided
by cython.h). You'd get a Cy_INCREF(obj, have_gil)/Cy_DECREF() etc.
Every class using this functionality is a subclass of CythonObject
(that contains a PyObject + an acquisition count + a lock). Perhaps if
the user is subclassing something other than object we could allow the
user to specify custom __cython_(un)lock__ and
__cython_acquisition_count__ methods and fields.

Now, building on top of this functionality, Cython could provide
built-in nogil-compatible types, like lists, dicts and maybe tuples
(as a start). These will by default not lock for operations to allow
e.g. one thread to iterate over the list and another thread to index
it without lock contention and other general overhead. If one thread
is somehow changing the size of the list, or writing to indices that
another thread is reading from/writing to, the results will of course
be undefined unless the user synchronizes on the object. So it would
be the user's responsibility. The acquisition counting itself will
always be thread-safe (i.e., it will be atomic if possible, otherwise
it will lock).

It's probably best to not enable this functionality by default as it
would be more expensive to instantiate objects, but it could be
supported through a cdef class decorator and a general directive.


It's well known that this would be expensive. One of the approaches that
tried to get rid of the GIL in CPython introduced fine grained locking,
and it turned out to be substantially slower, AFAIR by a factor of two.


I'd gladly take a factor two (or even four) slowdown of CPython code any 
day to get rid of the GIL :-). The thing is, sometimes one has 48 cores 
and consider a 10x speedup better than nothing...


Dag Sverre
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] Acquisition counted cdef classes

2011-10-25 Thread Stefan Behnel

Dag Sverre Seljebotn, 25.10.2011 15:28:

On 10/25/2011 09:33 AM, Stefan Behnel wrote:

mark florisson, 24.10.2011 21:50:

All of this functionality should also get a sane C API (to be provided
by cython.h). You'd get a Cy_INCREF(obj, have_gil)/Cy_DECREF() etc.
Every class using this functionality is a subclass of CythonObject
(that contains a PyObject + an acquisition count + a lock). Perhaps if
the user is subclassing something other than object we could allow the
user to specify custom __cython_(un)lock__ and
__cython_acquisition_count__ methods and fields.

Now, building on top of this functionality, Cython could provide
built-in nogil-compatible types, like lists, dicts and maybe tuples
(as a start). These will by default not lock for operations to allow
e.g. one thread to iterate over the list and another thread to index
it without lock contention and other general overhead. If one thread
is somehow changing the size of the list, or writing to indices that
another thread is reading from/writing to, the results will of course
be undefined unless the user synchronizes on the object. So it would
be the user's responsibility. The acquisition counting itself will
always be thread-safe (i.e., it will be atomic if possible, otherwise
it will lock).

It's probably best to not enable this functionality by default as it
would be more expensive to instantiate objects, but it could be
supported through a cdef class decorator and a general directive.


It's well known that this would be expensive. One of the approaches that
tried to get rid of the GIL in CPython introduced fine grained locking,
and it turned out to be substantially slower, AFAIR by a factor of two.


I'd gladly take a factor two (or even four) slowdown of CPython code any
day to get rid of the GIL :-). The thing is, sometimes one has 48 cores and
consider a 10x speedup better than nothing...


Ah, sorry, that factor was for single-threaded code. How it would scale for 
multi-core code depends on too many factors to make any general statement.


Stefan
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] Acquisition counted cdef classes

2011-10-25 Thread mark florisson
On 25 October 2011 12:22, Stefan Behnel  wrote:
> mark florisson, 25.10.2011 11:11:
>>
>> On 25 October 2011 08:33, Stefan Behnel wrote:
>>>
>>> mark florisson, 24.10.2011 21:50:

 This is in response to


 http://groups.google.com/group/cython-users/browse_thread/thread/bcbc5fe0e329224f
 and http://trac.cython.org/cython_trac/ticket/498 , and some of the
 previous discussion on cython.parallel.

 Basically I think we should have something more powerful than 'cdef
 borrowed CdefClass obj', something that also doesn't rely on new
 syntax.
>>>
>>> We will still need borrowed reference support in the compiler eventually,
>>> whether we make it a language feature or not.
>>
>> I'm not sure I understand why, acquisition counting could solve these
>> problems for cdef classes, and general objects may not be used without
>> the GIL. Do you want this as an optimization?
>
> Yes. Think of type(x), for example, or PyDict_GetItem(). They return
> borrowed references, and in many cases, Cython wouldn't have to INCREF and
> DECREF them when they are only being used as part of some specific kinds of
> expressions. The same applies to some utility functions in Cython that
> currently must INCREF their return value unconditionally, simply because
> they can't tell Cython that they could also return a borrowed reference
> instead. If there was a way to do that, we could optimise the reference
> counting away in a couple of more places, which would get us another bit
> closer to hand-tuned code.
>
> However, note that this doesn't necessarily have an impact on nogil code. If
> you took a borrowed reference in one nogil thread, and a gil-holding thread
> deletes the object at the same time or during the lifetime of the borrowed
> reference (e.g. by updating a dict or assigning to a cdef attribute), the
> nogil thread would end up with a dead pointer in its hands. That's why the
> usage of borrowed references needs to be explicit in the code ("I know what
> I'm doing"), and the optimisations require the GIL to be held.
>

I see, ok. Thanks, that really helped me see the motivation behind it
(i.e., the INC/DECREF really is a performance issue for you).

 What if we support acquisition counting for every instance of a cdef
 class? In Python and Cython GIL mode you use reference counting, and
 in Cython nogil mode and for structs attributes, array dtypes etc you
 use acquisition counting. This allows you to pass around cdef objects
 without the GIL and use their nogil methods. If the acquisition count
 is greater than 1, the acquisition count owns a reference to the
 object. If it reaches 0 you discard your owned reference (you can
 simply acquire the GIL if you don't have it) and when you increment
 from zero you obtain it. Perhaps something like libatomic could be
 used to efficiently implement this.
>>>
>>> Where would you store that count? In the object struct? That would
>>> increase
>>> the size of each instance.
>>
>> Yes, not just the count, also the lock. This feature would be optional
>> and may be very useful for people (I think).
>
> Well, as long as it's an optional feature that requires a class decorator,
> the only obvious drawback is that it'll bloat the compiler even more than it
> is already.
>

Actually, I think it will help the implementation of mutexes and async
objects if we want those, and possibly other stuff in the future. The
acquisition counting is basically already there (for memoryviews), so
it's easy to track down where and when to apply this. However one
major problem would be circular acquisition counts, so you'd also have
to implement a garbage collector like CPython has (e.g. if you have a
cdef class with a cython.parallel.dict). We should just have a real
garbage collector instead of all the counting crap. Or we could make
it a burden for the user...

I agree that this is really not as feasible as I first thought. It
actually shows me a problem where I can have a memoryview object in a
memoryview with dtype 'object', although the problem here is that the
memoryview object doesn't traverse the object in the Py_buffer, or
when coerced from a memoryview slice to a memoryview object, the
memoryview slice struct object... I suppose I need to fix that (but
I'm not sure how, as you can't provide a manual traverse function in
Cython).

But I really believe that these are much-wanted features. If you're
using threads in Python you can only get concurrency not parallelism,
unless you release the GIL, even if there is some performance overhead
it will still be a lot better than sequential execution. Perhaps when
cython.parallel will be more mature, we may get functionality to
specify data distribution schemes and message passing, in which case
the GIL won't be a problem. But many things would be harder or much
more expensive, e.g. transposing, sending objects etc.

I think I'll just drop this discussion for now. I'm going 

Re: [Cython] Acquisition counted cdef classes

2011-10-25 Thread Stefan Behnel

mark florisson, 25.10.2011 18:58:

On 25 October 2011 12:22, Stefan Behnel wrote:

mark florisson, 25.10.2011 11:11:

On 25 October 2011 08:33, Stefan Behnel wrote:

mark florisson, 24.10.2011 21:50:

What if we support acquisition counting for every instance of a cdef
class? In Python and Cython GIL mode you use reference counting, and
in Cython nogil mode and for structs attributes, array dtypes etc you
use acquisition counting. This allows you to pass around cdef objects
without the GIL and use their nogil methods. If the acquisition count
is greater than 1, the acquisition count owns a reference to the
object. If it reaches 0 you discard your owned reference (you can
simply acquire the GIL if you don't have it) and when you increment
from zero you obtain it. Perhaps something like libatomic could be
used to efficiently implement this.


Where would you store that count? In the object struct? That would
increase the size of each instance.


Yes, not just the count, also the lock. This feature would be optional
and may be very useful for people (I think).


Well, as long as it's an optional feature that requires a class decorator,
the only obvious drawback is that it'll bloat the compiler even more than it
is already.


Actually, I think it will help the implementation of mutexes and async
objects if we want those, and possibly other stuff in the future.


If all you want is to support the regular with statement in nogil blocks, 
part of that is implemented already. I recently added support for 
implementing the context manager's __enter__() method as c(p)def method. 
However, __exit__() isn't there yet, as it's a bit more tricky - maybe 
taking off a C pointer to the cdef method and calling that, or calling the 
cdef method directly instead (not sure), but always making sure that there 
still is a reference to the context manager itself, and eventually freeing 
it. I'm sure it can be done, though, maybe with some restrictions in nogil 
mode. If we additionally fix it up to use the exception propagation and 
try-finally support that you wrote for the with-gil feature, we're 
basically there.




The
acquisition counting is basically already there (for memoryviews), so
it's easy to track down where and when to apply this. However one
major problem would be circular acquisition counts, so you'd also have
to implement a garbage collector like CPython has (e.g. if you have a
cdef class with a cython.parallel.dict). We should just have a real
garbage collector instead of all the counting crap. Or we could make
it a burden for the user...


Right, these things can grow endlessly. It took CPython something like a 
dozen years to a) recognise the need for and b) implement a garbage 
collector. Let's hope that Cython will never get one.




I agree that this is really not as feasible as I first thought. It
actually shows me a problem where I can have a memoryview object in a
memoryview with dtype 'object', although the problem here is that the
memoryview object doesn't traverse the object in the Py_buffer, or
when coerced from a memoryview slice to a memoryview object, the
memoryview slice struct object... I suppose I need to fix that (but
I'm not sure how, as you can't provide a manual traverse function in
Cython).


No, you may have to descend into C here. Or, you could disable a Python 
object dtype for the time being?




But I really believe that these are much-wanted features. If you're
using threads in Python you can only get concurrency not parallelism,
unless you release the GIL, even if there is some performance overhead
it will still be a lot better than sequential execution. Perhaps when
cython.parallel will be more mature, we may get functionality to
specify data distribution schemes and message passing, in which case
the GIL won't be a problem. But many things would be harder or much
more expensive, e.g. transposing, sending objects etc.


See? That's what I mean with language complexity. These things quickly turn 
into an open can of worms. I don't think the language should handle any of 
these. Message passing is up to libraries, for example. If you want 
language support, use Erlang.




The advantages are:

1) allow users to pass around cdef typed objects in nogil mode
2) allow cdef typed objects in as struct attributes or array elements
3) make it easy to implement things like memoryviews (already done but
would have been a lot easier), cython.parallel.async/future objects,
cython.parallel.mutex objects and possibly other things in the future


Would it really be easier? You can already call cdef methods in nogil
mode,
AFAIR.


Sure, but you cannot store cdef objects as struct attributes, array
elements (you could implement it with reference counting, but not for
nogil mode)


You could do that with borrowed references, though, assuming that you keep
another reference around (or do your own ref-counting). However, I do see
that keeping a real reference around may be hard to do in some cases.


Re: [Cython] Acquisition counted cdef classes

2011-10-25 Thread mark florisson
On 25 October 2011 19:10, Stefan Behnel  wrote:
> mark florisson, 25.10.2011 18:58:
>>
>> On 25 October 2011 12:22, Stefan Behnel wrote:
>>>
>>> mark florisson, 25.10.2011 11:11:

 On 25 October 2011 08:33, Stefan Behnel wrote:
>
> mark florisson, 24.10.2011 21:50:
>>
>> What if we support acquisition counting for every instance of a cdef
>> class? In Python and Cython GIL mode you use reference counting, and
>> in Cython nogil mode and for structs attributes, array dtypes etc you
>> use acquisition counting. This allows you to pass around cdef objects
>> without the GIL and use their nogil methods. If the acquisition count
>> is greater than 1, the acquisition count owns a reference to the
>> object. If it reaches 0 you discard your owned reference (you can
>> simply acquire the GIL if you don't have it) and when you increment
>> from zero you obtain it. Perhaps something like libatomic could be
>> used to efficiently implement this.
>
> Where would you store that count? In the object struct? That would
> increase the size of each instance.

 Yes, not just the count, also the lock. This feature would be optional
 and may be very useful for people (I think).
>>>
>>> Well, as long as it's an optional feature that requires a class
>>> decorator,
>>> the only obvious drawback is that it'll bloat the compiler even more than
>>> it
>>> is already.
>>
>> Actually, I think it will help the implementation of mutexes and async
>> objects if we want those, and possibly other stuff in the future.
>
> If all you want is to support the regular with statement in nogil blocks,
> part of that is implemented already. I recently added support for
> implementing the context manager's __enter__() method as c(p)def method.
> However, __exit__() isn't there yet, as it's a bit more tricky - maybe
> taking off a C pointer to the cdef method and calling that, or calling the
> cdef method directly instead (not sure), but always making sure that there
> still is a reference to the context manager itself, and eventually freeing
> it. I'm sure it can be done, though, maybe with some restrictions in nogil
> mode. If we additionally fix it up to use the exception propagation and
> try-finally support that you wrote for the with-gil feature, we're basically
> there.
>

Cool. I suppose if you combine that with borrowed references you may
just get somewhere implementing the mutexes. On the other hand it
won't really be more convenient than passing OpenMP or Python locks
around, just slightly more pythonic.

>> The
>> acquisition counting is basically already there (for memoryviews), so
>> it's easy to track down where and when to apply this. However one
>> major problem would be circular acquisition counts, so you'd also have
>> to implement a garbage collector like CPython has (e.g. if you have a
>> cdef class with a cython.parallel.dict). We should just have a real
>> garbage collector instead of all the counting crap. Or we could make
>> it a burden for the user...
>
> Right, these things can grow endlessly. It took CPython something like a
> dozen years to a) recognise the need for and b) implement a garbage
> collector. Let's hope that Cython will never get one.
>
>
>> I agree that this is really not as feasible as I first thought. It
>> actually shows me a problem where I can have a memoryview object in a
>> memoryview with dtype 'object', although the problem here is that the
>> memoryview object doesn't traverse the object in the Py_buffer, or
>> when coerced from a memoryview slice to a memoryview object, the
>> memoryview slice struct object... I suppose I need to fix that (but
>> I'm not sure how, as you can't provide a manual traverse function in
>> Cython).
>
> No, you may have to descend into C here. Or, you could disable a Python
> object dtype for the time being?
>

Yes disabling would be easy, but it should be fixed (at some point).
Perhaps I can just override the tp_traverse of the type object in the
module init function (and maybe save that pointer and call it from the
new function + traverse the Py_buffer).

I'm not entire sure how we support Py_buffer, but it is a built-in
thing and it doesn't result in a traverse:

cdef class X(object):
cdef Py_buffer view

<- this won't have a traverse function. Fixing that won't get me there
though, I need to do the same thing for memoryview objects wrapping a
memoryview struct.

>> But I really believe that these are much-wanted features. If you're
>> using threads in Python you can only get concurrency not parallelism,
>> unless you release the GIL, even if there is some performance overhead
>> it will still be a lot better than sequential execution. Perhaps when
>> cython.parallel will be more mature, we may get functionality to
>> specify data distribution schemes and message passing, in which case
>> the GIL won't be a problem. But many things would be harder or much
>> more expensive, e.g. 

Re: [Cython] Acquisition counted cdef classes

2011-10-25 Thread Dag Sverre Seljebotn

On 10/25/2011 06:58 PM, mark florisson wrote:

On 25 October 2011 12:22, Stefan Behnel  wrote:

The problem is not so much the INCREF (which is just an indirect add), it's
the DECREF, which contains a conditional jump based on an unknown external
value, that may trigger external code. That can kill several C compiler
optimisations for the surrounding code. (And that would only get worse by
using a dedicated locking mechanism.)


What you could do is a form of psuedo-garbage-collection where, when the 
Cython refcount/acquisition count reaches 0, you enqueue a Python DECREF 
until you're holding the GIL anyway. If sticking it into the queue is 
unlikely(), and it is transparent to the compiler that it doesn't 
dispatch into unknown code.


(And regarding Stefan's comment about Erlang: It's all about available 
libraries. A language for concurrent computing running on CPython and 
able to use all the libraries available for CPython would be awesome. It 
doesn't need to be named Cython -- show me an Erlang port to the CPython 
platform and I'd perhaps jump ship.)




Anyway, sorry for the long mail. I agree this is likely not feasible
to implement, although I would like the functionality to be there.
Perhaps I'm trying to solve problems which don't really need to be
solved. Maybe we should just use multiprocessing, or MPI and numpy
with global arrays and pickling. Maybe memoryviews could help out with
that as well.


Nice conclusion. I think prange was a very nice 80%-there-solution 
(which is also the way we framed it when starting), but the GIL just 
creates to many barriers. Real garbage collection is needed, and CPython 
just isn't there.


What I'd like to see personally is:

 - A convenient utility to allocate an array in shared memory, so that 
when you pickle a view of it and send it to another Python process with 
multiprocessing and it unpickles, it gets a slice into to the same 
shared memory. People already do this but it's just a lot of jumping 
through hoops. A good place would probably be in NumPy.


 - Decent message passing using ZeroMQ in Cython code without any 
Python overhead, for fine-grained communication in Cython code in Python 
processes spawned using multiprocessing. I think this requires some 
syntax candy in Cython to feel natural enough, but perhaps it can be put 
on a form so that it is not ZeroMQ-specific.


Dag Sverre
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] Acquisition counted cdef classes

2011-10-25 Thread Dag Sverre Seljebotn

On 10/25/2011 09:01 PM, Dag Sverre Seljebotn wrote:

On 10/25/2011 06:58 PM, mark florisson wrote:

On 25 October 2011 12:22, Stefan Behnel wrote:

The problem is not so much the INCREF (which is just an indirect
add), it's
the DECREF, which contains a conditional jump based on an unknown
external
value, that may trigger external code. That can kill several C compiler
optimisations for the surrounding code. (And that would only get
worse by
using a dedicated locking mechanism.)


What you could do is a form of psuedo-garbage-collection where, when the
Cython refcount/acquisition count reaches 0, you enqueue a Python DECREF
until you're holding the GIL anyway. If sticking it into the queue is
unlikely(), and it is transparent to the compiler that it doesn't
dispatch into unknown code.


...then the C compiler optimizations should presumably not be killed.

DS



(And regarding Stefan's comment about Erlang: It's all about available
libraries. A language for concurrent computing running on CPython and
able to use all the libraries available for CPython would be awesome. It
doesn't need to be named Cython -- show me an Erlang port to the CPython
platform and I'd perhaps jump ship.)



Anyway, sorry for the long mail. I agree this is likely not feasible
to implement, although I would like the functionality to be there.
Perhaps I'm trying to solve problems which don't really need to be
solved. Maybe we should just use multiprocessing, or MPI and numpy
with global arrays and pickling. Maybe memoryviews could help out with
that as well.


Nice conclusion. I think prange was a very nice 80%-there-solution
(which is also the way we framed it when starting), but the GIL just
creates to many barriers. Real garbage collection is needed, and CPython
just isn't there.

What I'd like to see personally is:

- A convenient utility to allocate an array in shared memory, so that
when you pickle a view of it and send it to another Python process with
multiprocessing and it unpickles, it gets a slice into to the same
shared memory. People already do this but it's just a lot of jumping
through hoops. A good place would probably be in NumPy.

- Decent message passing using ZeroMQ in Cython code without any Python
overhead, for fine-grained communication in Cython code in Python
processes spawned using multiprocessing. I think this requires some
syntax candy in Cython to feel natural enough, but perhaps it can be put
on a form so that it is not ZeroMQ-specific.

Dag Sverre


___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] Acquisition counted cdef classes

2011-10-25 Thread Dag Sverre Seljebotn

On 10/25/2011 08:45 PM, mark florisson wrote:

On 25 October 2011 19:10, Stefan Behnel  wrote:

See? That's what I mean with language complexity. These things quickly turn
into an open can of worms. I don't think the language should handle any of
these. Message passing is up to libraries, for example. If you want language
support, use Erlang.



I haven't used Erlang (though I should give it a go), but I find that
built-in support for these things just ends up to be much more
elegant. MPI (and possibly zeromq) just look terrible and complicated
if you compare them to Unified Parallel C, High Performance Fortran or


Using libraries for message passing is sort of like doing complex string 
manipulation only using malloc, free, and string.h :-)



Co-Array Fortran. I don't know about Go channels. This doesn't mean
that we should support it, but we might consider it.


I think you should definitely read up on Go channels, they're just like 
what I'd like to write in Cython.


Dag Sverre
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] Acquisition counted cdef classes

2011-10-25 Thread mark florisson
On 25 October 2011 20:01, Dag Sverre Seljebotn
 wrote:
> On 10/25/2011 06:58 PM, mark florisson wrote:
>>
>> On 25 October 2011 12:22, Stefan Behnel  wrote:
>>>
>>> The problem is not so much the INCREF (which is just an indirect add),
>>> it's
>>> the DECREF, which contains a conditional jump based on an unknown
>>> external
>>> value, that may trigger external code. That can kill several C compiler
>>> optimisations for the surrounding code. (And that would only get worse by
>>> using a dedicated locking mechanism.)
>
> What you could do is a form of psuedo-garbage-collection where, when the
> Cython refcount/acquisition count reaches 0, you enqueue a Python DECREF
> until you're holding the GIL anyway. If sticking it into the queue is
> unlikely(), and it is transparent to the compiler that it doesn't dispatch
> into unknown code.

I thought about that as wel, but the problem is that you can only
defer the DECREF to a garbage collector if your acquisition count
reaches zero and your reference count is one. However, you may reach
an acquisition count of zero with a reference count > 1, which means
you could have the following race:

1) acquisition count reaches zero, a DECREF is pending in the
garbage collector thread
2) you obtain a nonzero acquisition count from the object (e.g. by
assigning a non-typed to a typed variable)
3) you lose your acquisition count again, another DECREF should be pending
4) the garbage collector figures out it needs to DECREF (it should
actually do this twice)

Now, you could keep a counter for how many times that happens, but
that will likely not be better than an immediate DECREF. In short,
reference counting is terrible. I think unlikely() will help the
compiler here as you said though, and your processor will have branch
prediction, out of order execution and conditional instructions which
may all help.

> (And regarding Stefan's comment about Erlang: It's all about available
> libraries. A language for concurrent computing running on CPython and able
> to use all the libraries available for CPython would be awesome. It doesn't
> need to be named Cython -- show me an Erlang port to the CPython platform
> and I'd perhaps jump ship.)
>
>
>> Anyway, sorry for the long mail. I agree this is likely not feasible
>> to implement, although I would like the functionality to be there.
>> Perhaps I'm trying to solve problems which don't really need to be
>> solved. Maybe we should just use multiprocessing, or MPI and numpy
>> with global arrays and pickling. Maybe memoryviews could help out with
>> that as well.
>
> Nice conclusion. I think prange was a very nice 80%-there-solution (which is
> also the way we framed it when starting), but the GIL just creates to many
> barriers. Real garbage collection is needed, and CPython just isn't there.
>
> What I'd like to see personally is:
>
>  - A convenient utility to allocate an array in shared memory, so that when
> you pickle a view of it and send it to another Python process with
> multiprocessing and it unpickles, it gets a slice into to the same shared
> memory. People already do this but it's just a lot of jumping through hoops.
> A good place would probably be in NumPy.

I haven't used it myself, but can the global array support help in that regard?

>  - Decent message passing using ZeroMQ in Cython code without any Python
> overhead, for fine-grained communication in Cython code in Python processes
> spawned using multiprocessing. I think this requires some syntax candy in
> Cython to feel natural enough, but perhaps it can be put on a form so that
> it is not ZeroMQ-specific.
>
> Dag Sverre
> ___
> cython-devel mailing list
> cython-devel@python.org
> http://mail.python.org/mailman/listinfo/cython-devel
>
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] Acquisition counted cdef classes

2011-10-25 Thread mark florisson
On 25 October 2011 20:15, Dag Sverre Seljebotn
 wrote:
> On 10/25/2011 08:45 PM, mark florisson wrote:
>>
>> On 25 October 2011 19:10, Stefan Behnel  wrote:
>>>
>>> See? That's what I mean with language complexity. These things quickly
>>> turn
>>> into an open can of worms. I don't think the language should handle any
>>> of
>>> these. Message passing is up to libraries, for example. If you want
>>> language
>>> support, use Erlang.
>>>
>>
>> I haven't used Erlang (though I should give it a go), but I find that
>> built-in support for these things just ends up to be much more
>> elegant. MPI (and possibly zeromq) just look terrible and complicated
>> if you compare them to Unified Parallel C, High Performance Fortran or
>
> Using libraries for message passing is sort of like doing complex string
> manipulation only using malloc, free, and string.h :-)
>
>> Co-Array Fortran. I don't know about Go channels. This doesn't mean
>> that we should support it, but we might consider it.
>
> I think you should definitely read up on Go channels, they're just like what
> I'd like to write in Cython.

That's a good motivator :) I'll do that.

> Dag Sverre
> ___
> cython-devel mailing list
> cython-devel@python.org
> http://mail.python.org/mailman/listinfo/cython-devel
>
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] Acquisition counted cdef classes

2011-10-25 Thread Greg Ewing

Dag Sverre Seljebotn wrote:

I'd gladly take a factor two (or even four) slowdown of CPython code any 
day to get rid of the GIL :-). The thing is, sometimes one has 48 cores 
and consider a 10x speedup better than nothing...


Another thing to consider is that locking around refcount
changes may not be as expensive in typical Cython code as
it is in Python.

The trouble with Python is that you can't so much as scratch
your nose without touching a big pile of ref counts. But
if the Cython code is only dealing with a few Python objects
and doing most of its work at the C level, the relative
overhead of locking around refcount changes may not be
significant.

So it may be worth trying the strategy of just acquiring
the GIL whenever a refcount needs to be changed in a nogil
section, and damn the consequences.

--
Greg
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel