Re: [Cython] CEP1000: Native dispatch through callables

2012-04-13 Thread Dag Sverre Seljebotn

On 04/13/2012 01:38 AM, Robert Bradshaw wrote:

On Thu, Apr 12, 2012 at 3:34 PM, Dag Sverre Seljebotn
  wrote:

On 04/13/2012 12:11 AM, Dag Sverre Seljebotn wrote:


Travis Oliphant recently raised the issue on the NumPy list of what
mechanisms to use to box native functions produced by his Numba so that
SciPy functions can call it, e.g. (I'm making the numba part up):

@numba # Compiles function using LLVM
def f(x):
return 3 * x

print scipy.integrate.quad(f, 1, 2) # do many callbacks natively!

Obviously, we want something standard, so that Cython functions can also
be called in a fast way.

This is very similar to CEP 523
(http://wiki.cython.org/enhancements/nativecall), but rather than
Cython-to-Cython, we want something that both SciPy, NumPy, numba,
Cython, f2py, fwrap can implement.

Here's my proposal; Travis seems happy to implement something like it
for numba and parts of SciPy:

http://wiki.cython.org/enhancements/nativecall



I'm sorry. HERE is the CEP:

http://wiki.cython.org/enhancements/cep1000

Since writing that yesterday, I've moved more in the direction of wanting a
zero-terminated list of overloads instead of providing a count, and have the
fast protocol jump over the header (since version is available elsewhere),
and just demand that the structure is sizeof(void*)-aligned in the first
place rather than the complicated padding.


Great idea to coordinate with the many other projects here. Eventually
this could maybe even be a PEP.

Somewhat related, I'd like to add support for Go-style interfaces.
These would essentially be vtables of pre-fetched function pointers,
and could play very nicely with this interface.


Yep; but you agree that this can be done in isolation without 
considering vtables first?



Have you given any thought as to what happens if __call__ is
re-assigned for an object (or subclass of an object) supporting this
interface? Or is this out of scope?


Out-of-scope, I'd say. Though you can always write an object that 
detects if you assign to __call__...



Minor nit: I don't think should_dereference is worth branching on, if
one wants to save the allocation one can still use a variable-sized
type and point to oneself. Yes, that's an extra dereference, but the
memory is already likely close and it greatly simplifies the logic.
But I could be wrong here.


Those minor nits are exactly what I seek; since Travis will have the 
first implementation in numba<->SciPy, I just want to make sure that 
what he does will work efficiently work Cython.


Can we perhaps just require that the information is embedded in the object?

I must admit that when I wrote that I was mostly thinking of JIT-style 
code generation, where you only use should_dereference for 
code-generation. But yes, by converting the table to a C structure you 
can do without a JIT.




Also, I'm not sure the type registration will scale, especially if
every callable type wanted to get registered. (E.g. currently closures
and generators are new types...) Where to draw the line? (Perhaps
things could get registered lazily on the first __nativecall__ lookup,
as they're likely to be looked up again?)


Right... if we do some work to synchronize the types for Cython modules 
generated by the same version of Cython, we're left with 3-4 types for 
Cython, right? Then a couple for numba and one for f2py; so on the order 
of 10?


An alternative is do something funny in the type object to get across 
the offset-in-object information (abusing the docstring, or introduce 
our own flag which means that the type object has an additional 
non-standard field at the end).


Dag
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] CEP1000: Native dispatch through callables

2012-04-13 Thread Dag Sverre Seljebotn

On 04/13/2012 07:24 AM, Stefan Behnel wrote:

Dag Sverre Seljebotn, 13.04.2012 00:34:

On 04/13/2012 12:11 AM, Dag Sverre Seljebotn wrote:

Travis Oliphant recently raised the issue on the NumPy list of what
mechanisms to use to box native functions produced by his Numba so that
SciPy functions can call it, e.g. (I'm making the numba part up):

@numba # Compiles function using LLVM
def f(x):
return 3 * x

print scipy.integrate.quad(f, 1, 2) # do many callbacks natively!

Obviously, we want something standard, so that Cython functions can also
be called in a fast way.

This is very similar to CEP 523
(http://wiki.cython.org/enhancements/nativecall), but rather than
Cython-to-Cython, we want something that both SciPy, NumPy, numba,
Cython, f2py, fwrap can implement.

Here's my proposal; Travis seems happy to implement something like it
for numba and parts of SciPy:

http://wiki.cython.org/enhancements/nativecall


I'm sorry. HERE is the CEP:

http://wiki.cython.org/enhancements/cep1000


Some general remarks:

I'm all for doing something in this direction and have been hinting at it
on the PyPy mailing list for a while, without reaction so far. I'll trigger
them again, with a pointer to this discussion and the CEP. PyPy should be
totally interested in a generic way to do fast calls into wrapped C code in
general and Cython implemented functions specifically. Their JIT would then
look at the function at runtime and unwrap it.

There's PEP 362 which proposes a Signature object. It seems to have
attracted some interest lately and Guido seems to like it also. I think we
should come up with a way to add a C level interface to that, instead of
designing something entirely separate.

http://www.python.org/dev/peps/pep-0362/


Well, provided that you still want an efficient representation that can 
be strcmp-ed in dispatch codes, this seems to boil down to using a 
Signature object rather than a capsule (with a C interface), and store 
it in __signature__ rather than __fastcall__, and perhaps provide a slot 
in the type object for a function returning it.


I really think the right approach is to prove the concept outside of the 
standardization process first; a) by the time a PEP would be accepted it 
will have been years since Travis had time to work on this, b) as far as 
the slot in the type object goes, we're left with users on Python 2.4 
today; a Python 3.4+ solution is not really a solution.


Dag
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


[Cython] PyPy sprint in Leipzig, June 22-27 (was: Re: CEP1000: Native dispatch through callables)

2012-04-13 Thread Stefan Behnel
Stefan Behnel, 13.04.2012 07:24:
> Dag Sverre Seljebotn, 13.04.2012 00:34:
>> http://wiki.cython.org/enhancements/cep1000
> 
> I'm all for doing something in this direction and have been hinting at it
> on the PyPy mailing list for a while, without reaction so far. I'll trigger
> them again, with a pointer to this discussion and the CEP. PyPy should be
> totally interested in a generic way to do fast calls into wrapped C code in
> general and Cython implemented functions specifically. Their JIT would then
> look at the function at runtime and unwrap it.

BTW, there will be a PyPy sprint in Leipzig from June 22-27. If anyone's
interested in coordinating with PyPy on this and other topics, that might
be a good place to go for a day or two.

http://permalink.gmane.org/gmane.comp.python.pypy/9896

Stefan
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] CEP1000: Native dispatch through callables

2012-04-13 Thread Robert Bradshaw
On Fri, Apr 13, 2012 at 1:52 AM, Dag Sverre Seljebotn
 wrote:
> On 04/13/2012 01:38 AM, Robert Bradshaw wrote:
>>
>> On Thu, Apr 12, 2012 at 3:34 PM, Dag Sverre Seljebotn
>>   wrote:
>>>
>>> On 04/13/2012 12:11 AM, Dag Sverre Seljebotn wrote:


 Travis Oliphant recently raised the issue on the NumPy list of what
 mechanisms to use to box native functions produced by his Numba so that
 SciPy functions can call it, e.g. (I'm making the numba part up):

 @numba # Compiles function using LLVM
 def f(x):
 return 3 * x

 print scipy.integrate.quad(f, 1, 2) # do many callbacks natively!

 Obviously, we want something standard, so that Cython functions can also
 be called in a fast way.

 This is very similar to CEP 523
 (http://wiki.cython.org/enhancements/nativecall), but rather than
 Cython-to-Cython, we want something that both SciPy, NumPy, numba,
 Cython, f2py, fwrap can implement.

 Here's my proposal; Travis seems happy to implement something like it
 for numba and parts of SciPy:

 http://wiki.cython.org/enhancements/nativecall
>>>
>>>
>>>
>>> I'm sorry. HERE is the CEP:
>>>
>>> http://wiki.cython.org/enhancements/cep1000
>>>
>>> Since writing that yesterday, I've moved more in the direction of wanting
>>> a
>>> zero-terminated list of overloads instead of providing a count, and have
>>> the
>>> fast protocol jump over the header (since version is available
>>> elsewhere),
>>> and just demand that the structure is sizeof(void*)-aligned in the first
>>> place rather than the complicated padding.
>>
>>
>> Great idea to coordinate with the many other projects here. Eventually
>> this could maybe even be a PEP.
>>
>> Somewhat related, I'd like to add support for Go-style interfaces.
>> These would essentially be vtables of pre-fetched function pointers,
>> and could play very nicely with this interface.
>
>
> Yep; but you agree that this can be done in isolation without considering
> vtables first?

Yes, for sure.

>> Have you given any thought as to what happens if __call__ is
>> re-assigned for an object (or subclass of an object) supporting this
>> interface? Or is this out of scope?
>
>
> Out-of-scope, I'd say. Though you can always write an object that detects if
> you assign to __call__...
>
>
>> Minor nit: I don't think should_dereference is worth branching on, if
>> one wants to save the allocation one can still use a variable-sized
>> type and point to oneself. Yes, that's an extra dereference, but the
>> memory is already likely close and it greatly simplifies the logic.
>> But I could be wrong here.
>
>
> Those minor nits are exactly what I seek; since Travis will have the first
> implementation in numba<->SciPy, I just want to make sure that what he does
> will work efficiently work Cython.

+1

I have to admit building/invoking these var-arg-sized __nativecall__
records seems painful. Here's another suggestion:

struct {
void* pointer;
size_t signature; // compressed binary representation, 95% coverage
char* long_signature; // used if signature is not representable in
a size_t, as indicated by signature = 0
} record;

These char* could optionally be allocated at the end of the record*
for optimal locality. We could even dispense with the binary
signature, but having that option allows us to avoid strcmp for stuff
like d)d and ffi)f.

> Can we perhaps just require that the information is embedded in the object?

I think not, this would require variably-sized objects (and also use
up the variable sized nature). Given that this is in a portion of the
program that is iterating over a Python tuple, I think the extra
deference here is non-consequential.

> I must admit that when I wrote that I was mostly thinking of JIT-style code
> generation, where you only use should_dereference for code-generation. But
> yes, by converting the table to a C structure you can do without a JIT.
>
>
>>
>> Also, I'm not sure the type registration will scale, especially if
>> every callable type wanted to get registered. (E.g. currently closures
>> and generators are new types...) Where to draw the line? (Perhaps
>> things could get registered lazily on the first __nativecall__ lookup,
>> as they're likely to be looked up again?)
>
>
> Right... if we do some work to synchronize the types for Cython modules
> generated by the same version of Cython, we're left with 3-4 types for
> Cython, right? Then a couple for numba and one for f2py; so on the order of
> 10?

No, I think each closure is its own type.

> An alternative is do something funny in the type object to get across the
> offset-in-object information (abusing the docstring, or introduce our own
> flag which means that the type object has an additional non-standard field
> at the end).

It's a hack, but the flag + non-standard field idea might just work...
Ah, don't you just love C :)

- Robert
___
cython-devel mai

Re: [Cython] CEP1000: Native dispatch through callables

2012-04-13 Thread Stefan Behnel
Dag Sverre Seljebotn, 13.04.2012 11:13:
> On 04/13/2012 07:24 AM, Stefan Behnel wrote:
>> Dag Sverre Seljebotn, 13.04.2012 00:34:
>>> http://wiki.cython.org/enhancements/cep1000
>>
>> There's PEP 362 which proposes a Signature object. It seems to have
>> attracted some interest lately and Guido seems to like it also. I think we
>> should come up with a way to add a C level interface to that, instead of
>> designing something entirely separate.
>>
>> http://www.python.org/dev/peps/pep-0362/
> 
> Well, provided that you still want an efficient representation that can be
> strcmp-ed in dispatch codes, this seems to boil down to using a Signature
> object rather than a capsule (with a C interface), and store it in
> __signature__ rather than __fastcall__, and perhaps provide a slot in the
> type object for a function returning it.

Basically, yes. I was just bringing it up because we should keep it in mind
when designing a solution. Moving it into the Signature object would also
allow C signature introspection from Python code, for example. It would
obviously need a straight C level way to access it.

I'm not sure it has to be a function, though. I would prefer a simple array
of structs that map signature strings to function pointers. Like the
PyMethodDef struct.


> I really think the right approach is to prove the concept outside of the
> standardization process first; a) by the time a PEP would be accepted it
> will have been years since Travis had time to work on this, b) as far as
> the slot in the type object goes, we're left with users on Python 2.4
> today; a Python 3.4+ solution is not really a solution.

Sure. But nothing keeps us from backporting at least parts of it to older
Pythons, like we did for so many other things.

Stefan
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] CEP1000: Native dispatch through callables

2012-04-13 Thread Stefan Behnel
Robert Bradshaw, 13.04.2012 12:17:
> On Fri, Apr 13, 2012 at 1:52 AM, Dag Sverre Seljebotn wrote:
>> On 04/13/2012 01:38 AM, Robert Bradshaw wrote:
>>> Have you given any thought as to what happens if __call__ is
>>> re-assigned for an object (or subclass of an object) supporting this
>>> interface? Or is this out of scope?
>>
>> Out-of-scope, I'd say. Though you can always write an object that detects if
>> you assign to __call__...

+1 for out of scope. This is a pure C level feature.


>>> Minor nit: I don't think should_dereference is worth branching on, if
>>> one wants to save the allocation one can still use a variable-sized
>>> type and point to oneself. Yes, that's an extra dereference, but the
>>> memory is already likely close and it greatly simplifies the logic.
>>> But I could be wrong here.
>>
>>
>> Those minor nits are exactly what I seek; since Travis will have the first
>> implementation in numba<->SciPy, I just want to make sure that what he does
>> will work efficiently work Cython.
> 
> +1
> 
> I have to admit building/invoking these var-arg-sized __nativecall__
> records seems painful. Here's another suggestion:
> 
> struct {
> void* pointer;
> size_t signature; // compressed binary representation, 95% coverage
> char* long_signature; // used if signature is not representable in
> a size_t, as indicated by signature = 0
> } record;
> 
> These char* could optionally be allocated at the end of the record*
> for optimal locality. We could even dispense with the binary
> signature, but having that option allows us to avoid strcmp for stuff
> like d)d and ffi)f.

Assuming we use literals and a const char* for the signature, the C
compiler would cut down the number of signature strings automatically for
us. And a pointer comparison is the same as a size_t comparison.

That would only apply at a per-module level, though, so it would require an
indirection for the signature IDs. But it would avoid a global registry.

Another idea would be to set the signature ID field to 0 at the beginning
and call a C-API function to let the current runtime assign an ID > 0,
unique for the currently running application. Then every user would only
have to parse the signature once to adapt to the respective ID and could
otherwise branch based on it directly.

For Cython, we could generate a static ID variable for each typed call that
we found in the sources. When encountering a C signature on a callable,
either a) the ID variable is still empty (initial case), then we parse the
signature to see if it matches the expected signature. If it does, we
assign the corresponding ID to the static ID variable and issue a direct
call. If b) the ID field is already set (normal case), we compare the
signature IDs directly and issue a C call it they match. If the IDs do not
match, we issue a normal Python call.


>> Right... if we do some work to synchronize the types for Cython modules
>> generated by the same version of Cython, we're left with 3-4 types for
>> Cython, right? Then a couple for numba and one for f2py; so on the order of
>> 10?
> 
> No, I think each closure is its own type.

And that even applies to fused functions, right? They'd have one closure
for each type combination.


>> An alternative is do something funny in the type object to get across the
>> offset-in-object information (abusing the docstring, or introduce our own
>> flag which means that the type object has an additional non-standard field
>> at the end).
> 
> It's a hack, but the flag + non-standard field idea might just work...

Plus, it wouldn't have to stay a non-standard field. If it's accepted into
CPython 3.4, we could safely use it in all existing versions of CPython.

Stefan
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] CEP1000: Native dispatch through callables

2012-04-13 Thread Dag Sverre Seljebotn

On 04/13/2012 01:38 PM, Stefan Behnel wrote:

Robert Bradshaw, 13.04.2012 12:17:

On Fri, Apr 13, 2012 at 1:52 AM, Dag Sverre Seljebotn wrote:

On 04/13/2012 01:38 AM, Robert Bradshaw wrote:

Have you given any thought as to what happens if __call__ is
re-assigned for an object (or subclass of an object) supporting this
interface? Or is this out of scope?


Out-of-scope, I'd say. Though you can always write an object that detects if
you assign to __call__...


+1 for out of scope. This is a pure C level feature.



Minor nit: I don't think should_dereference is worth branching on, if
one wants to save the allocation one can still use a variable-sized
type and point to oneself. Yes, that's an extra dereference, but the
memory is already likely close and it greatly simplifies the logic.
But I could be wrong here.



Those minor nits are exactly what I seek; since Travis will have the first
implementation in numba<->SciPy, I just want to make sure that what he does
will work efficiently work Cython.


+1

I have to admit building/invoking these var-arg-sized __nativecall__
records seems painful. Here's another suggestion:

struct {
 void* pointer;
 size_t signature; // compressed binary representation, 95% coverage


Once you start passing around functions that take memory view slices as 
arguments, that 95% estimate will be off I think.



 char* long_signature; // used if signature is not representable in
a size_t, as indicated by signature = 0
} record;

These char* could optionally be allocated at the end of the record*
for optimal locality. We could even dispense with the binary
signature, but having that option allows us to avoid strcmp for stuff
like d)d and ffi)f.


Assuming we use literals and a const char* for the signature, the C
compiler would cut down the number of signature strings automatically for
us. And a pointer comparison is the same as a size_t comparison.


I'll go one further: Intern Python bytes objects. It's just a PyObject*, 
but it's *required* (or just strongly encouraged) to have gone through


sig = sys.modules['_nativecall']['interned_db'].setdefault(sig, sig)

Obviously in a PEP you'd have a C-API function for such interning 
(completely standalone utility). Performance of interning operation 
itself doesn't matter...


Unless CPython has interning features itself, like in Java? Was that 
present back in the day and then ripped out?


Requiring interning is somewhat less elegant in one way, but it makes a 
lot of other stuff much simpler.


That gives us

struct {
void *pointer;
PyBytesObject *signature;
} record;

and then you allocate a NULL-terminated arrays of these for all the 
overloads.




That would only apply at a per-module level, though, so it would require an
indirection for the signature IDs. But it would avoid a global registry.

Another idea would be to set the signature ID field to 0 at the beginning
and call a C-API function to let the current runtime assign an ID>  0,
unique for the currently running application. Then every user would only
have to parse the signature once to adapt to the respective ID and could
otherwise branch based on it directly.

For Cython, we could generate a static ID variable for each typed call that
we found in the sources. When encountering a C signature on a callable,
either a) the ID variable is still empty (initial case), then we parse the
signature to see if it matches the expected signature. If it does, we
assign the corresponding ID to the static ID variable and issue a direct
call. If b) the ID field is already set (normal case), we compare the
signature IDs directly and issue a C call it they match. If the IDs do not
match, we issue a normal Python call.



Right... if we do some work to synchronize the types for Cython modules
generated by the same version of Cython, we're left with 3-4 types for
Cython, right? Then a couple for numba and one for f2py; so on the order of
10?


No, I think each closure is its own type.


And that even applies to fused functions, right? They'd have one closure
for each type combination.



An alternative is do something funny in the type object to get across the
offset-in-object information (abusing the docstring, or introduce our own
flag which means that the type object has an additional non-standard field
at the end).


It's a hack, but the flag + non-standard field idea might just work...


Plus, it wouldn't have to stay a non-standard field. If it's accepted into
CPython 3.4, we could safely use it in all existing versions of CPython.


Sounds good. Perhaps just find a single "extended", then add a new flag 
field in our payload, in case we need to extend the types object yet 
again later and run out of unused flag bits (TBD: figure out how many 
unused flag bits there are).


Dag
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] CEP1000: Native dispatch through callables

2012-04-13 Thread Nathaniel Smith
On Fri, Apr 13, 2012 at 9:52 AM, Dag Sverre Seljebotn
 wrote:
> On 04/13/2012 01:38 AM, Robert Bradshaw wrote:
>> Also, I'm not sure the type registration will scale, especially if
>> every callable type wanted to get registered. (E.g. currently closures
>> and generators are new types...) Where to draw the line? (Perhaps
>> things could get registered lazily on the first __nativecall__ lookup,
>> as they're likely to be looked up again?)
>
>
> Right... if we do some work to synchronize the types for Cython modules
> generated by the same version of Cython, we're left with 3-4 types for
> Cython, right? Then a couple for numba and one for f2py; so on the order of
> 10?
>
> An alternative is do something funny in the type object to get across the
> offset-in-object information (abusing the docstring, or introduce our own
> flag which means that the type object has an additional non-standard field
> at the end).

In Python 2.7, it looks like there may be a few TP_FLAG bits free --
15 and 16 are labeled "reserved for stackless python", and 2, 11, 22
don't have anything defined.

There may also be an unused ssize_t field ob_size at the beginning of
the type object -- for some reason PyTypeObject is declared as
variable size (using PyObject_VAR_HEAD), but I don't see any
variable-size fields in it, the docs claim that the ob_size field is a
"historical artifact that is maintained for binary
compatibility...Always set this field to zero", and Include/object.h
has a definition for a PyHeapTypeObject which has a PyTypeObject as
its first member, which would not work if PyTypeObject had variable
size. Grep says that the only place where ob_type->ob_size is accessed
is in Objects/typeobject.c:object_sizeof(), which at first glance
appears to be a bug, and anyway I don't think anyone cares whether
__sizeof__ on C-callable objects is exactly correct. One could use
this for an offset, or even a pointer.

One could also add a field easily by just subclassing PyTypeObject.

The Signature thing seems like a distraction to me. Signature is
intended as just a nice convenient format for looking up stuff that's
otherwise stored in more obscure ways -- the API equivalent of
pretty-printing. The important thing here is getting the C-level
dispatch right.

-- Nathaniel
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] CEP1000: Native dispatch through callables

2012-04-13 Thread Nathaniel Smith
On Fri, Apr 13, 2012 at 12:59 PM, Dag Sverre Seljebotn
 wrote:
> I'll go one further: Intern Python bytes objects. It's just a PyObject*, but
> it's *required* (or just strongly encouraged) to have gone through
>
> sig = sys.modules['_nativecall']['interned_db'].setdefault(sig, sig)
>
> Obviously in a PEP you'd have a C-API function for such interning
> (completely standalone utility). Performance of interning operation itself
> doesn't matter...
>
> Unless CPython has interning features itself, like in Java? Was that present
> back in the day and then ripped out?

http://docs.python.org/library/functions.html#intern ? (C API:
PyString_InternInPlace, moved from __builtin__.intern to sys.intern in
Py3.)

- N
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] CEP1000: Native dispatch through callables

2012-04-13 Thread Stefan Behnel
Dag Sverre Seljebotn, 13.04.2012 13:59:
> On 04/13/2012 01:38 PM, Stefan Behnel wrote:
>> Robert Bradshaw, 13.04.2012 12:17:
>>> On Fri, Apr 13, 2012 at 1:52 AM, Dag Sverre Seljebotn wrote:
 On 04/13/2012 01:38 AM, Robert Bradshaw wrote:
> Minor nit: I don't think should_dereference is worth branching on, if
> one wants to save the allocation one can still use a variable-sized
> type and point to oneself. Yes, that's an extra dereference, but the
> memory is already likely close and it greatly simplifies the logic.
> But I could be wrong here.


 Those minor nits are exactly what I seek; since Travis will have the first
 implementation in numba<->SciPy, I just want to make sure that what he
 does will work efficiently work Cython.
>>>
>>> I have to admit building/invoking these var-arg-sized __nativecall__
>>> records seems painful. Here's another suggestion:
>>>
>>> struct {
>>>  void* pointer;
>>>  size_t signature; // compressed binary representation, 95% coverage
> 
> Once you start passing around functions that take memory view slices as
> arguments, that 95% estimate will be off I think.

Yes, I really think it makes sense to keeps IDs unique only over the
runtime of the application. (Note that using ssize_t instead of size_t
would allow setting the ID to -1 to disable signature matching, in case
that's ever needed.)


>>>  char* long_signature; // used if signature is not representable in
>>> a size_t, as indicated by signature = 0
>>> } record;
>>>
>>> These char* could optionally be allocated at the end of the record*
>>> for optimal locality. We could even dispense with the binary
>>> signature, but having that option allows us to avoid strcmp for stuff
>>> like d)d and ffi)f.
>>
>> Assuming we use literals and a const char* for the signature, the C
>> compiler would cut down the number of signature strings automatically for
>> us. And a pointer comparison is the same as a size_t comparison.
> 
> I'll go one further: Intern Python bytes objects. It's just a PyObject*,
> but it's *required* (or just strongly encouraged) to have gone through
> 
> sig = sys.modules['_nativecall']['interned_db'].setdefault(sig, sig)
> 
> Obviously in a PEP you'd have a C-API function for such interning
> (completely standalone utility). Performance of interning operation itself
> doesn't matter...
> 
> Unless CPython has interning features itself, like in Java? Was that
> present back in the day and then ripped out?

AFAIR, it always had to be done explicitly and is only available for
unicode objects in Py3 (and only for bytes objects in Py2). The CPython
parser also does it for identifiers, but it's not done automatically for
anything else. It's also not cheap to do - it would require a weakref dict
to accommodate for the temporary allocation of large strings, and weak
references have a certain overhead.

In any case, this is an entirely different use case that should be handled
differently from normal string interning.


> Requiring interning is somewhat less elegant in one way, but it makes a lot
> of other stuff much simpler.
> 
> That gives us
> 
> struct {
> void *pointer;
> PyBytesObject *signature;
> } record;
> 
> and then you allocate a NULL-terminated arrays of these for all the overloads.

However, the problem is the setup. These references will have to be created
at init time and discarded during runtime termination. Not a problem for
Cython generated code, but some overhead for hand written code.

Since the size of these structs is not a problem, I'd prefer keeping Python
objects out of the game and using an ssize_t ID instead, inferred from a
char* signature at module init time by calling a C-API function. That
avoids the need for any cleanup.

Stefan
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] CEP1000: Native dispatch through callables

2012-04-13 Thread mark florisson
On 13 April 2012 12:38, Stefan Behnel  wrote:
> Robert Bradshaw, 13.04.2012 12:17:
>> On Fri, Apr 13, 2012 at 1:52 AM, Dag Sverre Seljebotn wrote:
>>> On 04/13/2012 01:38 AM, Robert Bradshaw wrote:
 Have you given any thought as to what happens if __call__ is
 re-assigned for an object (or subclass of an object) supporting this
 interface? Or is this out of scope?
>>>
>>> Out-of-scope, I'd say. Though you can always write an object that detects if
>>> you assign to __call__...
>
> +1 for out of scope. This is a pure C level feature.
>
>
 Minor nit: I don't think should_dereference is worth branching on, if
 one wants to save the allocation one can still use a variable-sized
 type and point to oneself. Yes, that's an extra dereference, but the
 memory is already likely close and it greatly simplifies the logic.
 But I could be wrong here.
>>>
>>>
>>> Those minor nits are exactly what I seek; since Travis will have the first
>>> implementation in numba<->SciPy, I just want to make sure that what he does
>>> will work efficiently work Cython.
>>
>> +1
>>
>> I have to admit building/invoking these var-arg-sized __nativecall__
>> records seems painful. Here's another suggestion:
>>
>> struct {
>>     void* pointer;
>>     size_t signature; // compressed binary representation, 95% coverage
>>     char* long_signature; // used if signature is not representable in
>> a size_t, as indicated by signature = 0
>> } record;
>>
>> These char* could optionally be allocated at the end of the record*
>> for optimal locality. We could even dispense with the binary
>> signature, but having that option allows us to avoid strcmp for stuff
>> like d)d and ffi)f.
>
> Assuming we use literals and a const char* for the signature, the C
> compiler would cut down the number of signature strings automatically for
> us. And a pointer comparison is the same as a size_t comparison.
>
> That would only apply at a per-module level, though, so it would require an
> indirection for the signature IDs. But it would avoid a global registry.
>
> Another idea would be to set the signature ID field to 0 at the beginning
> and call a C-API function to let the current runtime assign an ID > 0,
> unique for the currently running application. Then every user would only
> have to parse the signature once to adapt to the respective ID and could
> otherwise branch based on it directly.
>
> For Cython, we could generate a static ID variable for each typed call that
> we found in the sources. When encountering a C signature on a callable,
> either a) the ID variable is still empty (initial case), then we parse the
> signature to see if it matches the expected signature. If it does, we
> assign the corresponding ID to the static ID variable and issue a direct
> call. If b) the ID field is already set (normal case), we compare the
> signature IDs directly and issue a C call it they match. If the IDs do not
> match, we issue a normal Python call.
>
>
>>> Right... if we do some work to synchronize the types for Cython modules
>>> generated by the same version of Cython, we're left with 3-4 types for
>>> Cython, right? Then a couple for numba and one for f2py; so on the order of
>>> 10?
>>
>> No, I think each closure is its own type.
>
> And that even applies to fused functions, right? They'd have one closure
> for each type combination.
>

Hm, there is only one type for the function (CyFunction), but there is
a different type for the closure scope for each closure. The same goes
for FusedFunction, there is only one type, and each instance contains
a dict of specializations (mapping signatures to PyCFunctions).

(But each module still has different function types of course).

>>> An alternative is do something funny in the type object to get across the
>>> offset-in-object information (abusing the docstring, or introduce our own
>>> flag which means that the type object has an additional non-standard field
>>> at the end).
>>
>> It's a hack, but the flag + non-standard field idea might just work...
>
> Plus, it wouldn't have to stay a non-standard field. If it's accepted into
> CPython 3.4, we could safely use it in all existing versions of CPython.
>
> Stefan
> ___
> cython-devel mailing list
> cython-devel@python.org
> http://mail.python.org/mailman/listinfo/cython-devel
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] CEP1000: Native dispatch through callables

2012-04-13 Thread mark florisson
On 13 April 2012 12:59, Dag Sverre Seljebotn  wrote:
> On 04/13/2012 01:38 PM, Stefan Behnel wrote:
>>
>> Robert Bradshaw, 13.04.2012 12:17:
>>>
>>> On Fri, Apr 13, 2012 at 1:52 AM, Dag Sverre Seljebotn wrote:

 On 04/13/2012 01:38 AM, Robert Bradshaw wrote:
>
> Have you given any thought as to what happens if __call__ is
> re-assigned for an object (or subclass of an object) supporting this
> interface? Or is this out of scope?


 Out-of-scope, I'd say. Though you can always write an object that
 detects if
 you assign to __call__...
>>
>>
>> +1 for out of scope. This is a pure C level feature.
>>
>>
> Minor nit: I don't think should_dereference is worth branching on, if
> one wants to save the allocation one can still use a variable-sized
> type and point to oneself. Yes, that's an extra dereference, but the
> memory is already likely close and it greatly simplifies the logic.
> But I could be wrong here.



 Those minor nits are exactly what I seek; since Travis will have the
 first
 implementation in numba<->SciPy, I just want to make sure that what he
 does
 will work efficiently work Cython.
>>>
>>>
>>> +1
>>>
>>> I have to admit building/invoking these var-arg-sized __nativecall__
>>> records seems painful. Here's another suggestion:
>>>
>>> struct {
>>>     void* pointer;
>>>     size_t signature; // compressed binary representation, 95% coverage
>
>
> Once you start passing around functions that take memory view slices as
> arguments, that 95% estimate will be off I think.
>

It kind of depends on which arguments types and how many arguments you
will allow, and whether or not collisions would be fine (which would
imply ID comparison + strcmp()).

>>>     char* long_signature; // used if signature is not representable in
>>> a size_t, as indicated by signature = 0
>>> } record;
>>>
>>> These char* could optionally be allocated at the end of the record*
>>> for optimal locality. We could even dispense with the binary
>>> signature, but having that option allows us to avoid strcmp for stuff
>>> like d)d and ffi)f.
>>
>>
>> Assuming we use literals and a const char* for the signature, the C
>> compiler would cut down the number of signature strings automatically for
>> us. And a pointer comparison is the same as a size_t comparison.
>
>
> I'll go one further: Intern Python bytes objects. It's just a PyObject*, but
> it's *required* (or just strongly encouraged) to have gone through
>
> sig = sys.modules['_nativecall']['interned_db'].setdefault(sig, sig)
>
> Obviously in a PEP you'd have a C-API function for such interning
> (completely standalone utility). Performance of interning operation itself
> doesn't matter...
>
> Unless CPython has interning features itself, like in Java? Was that present
> back in the day and then ripped out?
>
> Requiring interning is somewhat less elegant in one way, but it makes a lot
> of other stuff much simpler.
>
> That gives us
>
> struct {
>    void *pointer;
>    PyBytesObject *signature;
> } record;
>
> and then you allocate a NULL-terminated arrays of these for all the
> overloads.
>

Interesting. What I like about size_t it that it could define a
deterministic ordering, which means specializations could be stored in
a binary search tree in array form. Cython would precompute the size_t
for the specialization it needs (and maybe account for promotions as
well).

>>
>> That would only apply at a per-module level, though, so it would require
>> an
>> indirection for the signature IDs. But it would avoid a global registry.
>>
>> Another idea would be to set the signature ID field to 0 at the beginning
>> and call a C-API function to let the current runtime assign an ID>  0,
>> unique for the currently running application. Then every user would only
>> have to parse the signature once to adapt to the respective ID and could
>> otherwise branch based on it directly.
>>
>> For Cython, we could generate a static ID variable for each typed call
>> that
>> we found in the sources. When encountering a C signature on a callable,
>> either a) the ID variable is still empty (initial case), then we parse the
>> signature to see if it matches the expected signature. If it does, we
>> assign the corresponding ID to the static ID variable and issue a direct
>> call. If b) the ID field is already set (normal case), we compare the
>> signature IDs directly and issue a C call it they match. If the IDs do not
>> match, we issue a normal Python call.
>>
>>
 Right... if we do some work to synchronize the types for Cython modules
 generated by the same version of Cython, we're left with 3-4 types for
 Cython, right? Then a couple for numba and one for f2py; so on the order
 of
 10?
>>>
>>>
>>> No, I think each closure is its own type.
>>
>>
>> And that even applies to fused functions, right? They'd have one closure
>> for each type combination.
>>
>>
 An alternative is 

Re: [Cython] CEP1000: Native dispatch through callables

2012-04-13 Thread Stefan Behnel
Stefan Behnel, 13.04.2012 14:27:
> Dag Sverre Seljebotn, 13.04.2012 13:59:
>> Requiring interning is somewhat less elegant in one way, but it makes a lot
>> of other stuff much simpler.
>>
>> That gives us
>>
>> struct {
>> void *pointer;
>> PyBytesObject *signature;
>> } record;
>>
>> and then you allocate a NULL-terminated arrays of these for all the 
>> overloads.
> 
> However, the problem is the setup. These references will have to be created
> at init time and discarded during runtime termination. Not a problem for
> Cython generated code, but some overhead for hand written code.
> 
> Since the size of these structs is not a problem, I'd prefer keeping Python
> objects out of the game and using an ssize_t ID instead, inferred from a
> char* signature at module init time by calling a C-API function. That
> avoids the need for any cleanup.

Actually, we could even use interned char* values. Nothing keeps that C-API
setup function from reassigning the "char* signature" field to the char*
buffer of an internally allocated byte string. Except that we'd have to
*require* users to use literals or otherwise statically allocated C strings
in that field. Hmm, maybe not the best idea ever...

Stefan
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] CEP1000: Native dispatch through callables

2012-04-13 Thread mark florisson
On 13 April 2012 13:48, Stefan Behnel  wrote:
> Stefan Behnel, 13.04.2012 14:27:
>> Dag Sverre Seljebotn, 13.04.2012 13:59:
>>> Requiring interning is somewhat less elegant in one way, but it makes a lot
>>> of other stuff much simpler.
>>>
>>> That gives us
>>>
>>> struct {
>>>     void *pointer;
>>>     PyBytesObject *signature;
>>> } record;
>>>
>>> and then you allocate a NULL-terminated arrays of these for all the 
>>> overloads.
>>
>> However, the problem is the setup. These references will have to be created
>> at init time and discarded during runtime termination. Not a problem for
>> Cython generated code, but some overhead for hand written code.
>>
>> Since the size of these structs is not a problem, I'd prefer keeping Python
>> objects out of the game and using an ssize_t ID instead, inferred from a
>> char* signature at module init time by calling a C-API function. That
>> avoids the need for any cleanup.
>
> Actually, we could even use interned char* values. Nothing keeps that C-API
> setup function from reassigning the "char* signature" field to the char*
> buffer of an internally allocated byte string. Except that we'd have to
> *require* users to use literals or otherwise statically allocated C strings
> in that field. Hmm, maybe not the best idea ever...
>
> Stefan
> ___
> cython-devel mailing list
> cython-devel@python.org
> http://mail.python.org/mailman/listinfo/cython-devel

You could create a module shared by all versions and projects, which
exposes a function 'get_signature', which given a char *signature
returns the pointer that should be used in the ABI signature type
information. You can then always compare by identity.
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] CEP1000: Native dispatch through callables

2012-04-13 Thread Dag Sverre Seljebotn

On 04/13/2012 03:01 PM, mark florisson wrote:

On 13 April 2012 13:48, Stefan Behnel  wrote:

Stefan Behnel, 13.04.2012 14:27:

Dag Sverre Seljebotn, 13.04.2012 13:59:

Requiring interning is somewhat less elegant in one way, but it makes a lot
of other stuff much simpler.

That gives us

struct {
 void *pointer;
 PyBytesObject *signature;
} record;

and then you allocate a NULL-terminated arrays of these for all the overloads.


However, the problem is the setup. These references will have to be created
at init time and discarded during runtime termination. Not a problem for
Cython generated code, but some overhead for hand written code.

Since the size of these structs is not a problem, I'd prefer keeping Python
objects out of the game and using an ssize_t ID instead, inferred from a
char* signature at module init time by calling a C-API function. That
avoids the need for any cleanup.


Actually, we could even use interned char* values. Nothing keeps that C-API
setup function from reassigning the "char* signature" field to the char*
buffer of an internally allocated byte string. Except that we'd have to
*require* users to use literals or otherwise statically allocated C strings
in that field. Hmm, maybe not the best idea ever...

Stefan
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


You could create a module shared by all versions and projects, which
exposes a function 'get_signature', which given a char *signature
returns the pointer that should be used in the ABI signature type
information. You can then always compare by identity.


I fail to see how this is different from what I proposed, with interning 
bytes objects (which I still prefer; although the binary-search features 
of direct comparison makes that attractive too).


BTW, any proposal that requires an actual project/library that both 
Cython and NumPy depends on will fail in the real world.


Dag
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] pyregr test suite

2012-04-13 Thread Stefan Behnel
Stefan Behnel, 13.04.2012 07:11:
> Robert Bradshaw, 12.04.2012 22:21:
>> On Thu, Apr 12, 2012 at 11:21 AM, mark florisson wrote:
>>> Could we run the pyregr test suite manually instead of automatically?
>>> It takes a lot of resources to build, and a single simple push to the
>>> cython-devel branch results in the build slots being hogged for hours,
>>> making the continuous development a lot less 'continuous'. We could
>>> just decide to run the pyregr suite every so often, or whenever we
>>> make an addition or change that could actually affect Python code (if
>>> one updates a test then there is no use in running pyregr for
>>> instance).
>>
>> +1 to manual + periodic for these tests. Alternatively we could make
>> them depend on each other, so at most one core is consumed.
> 
> Ok, I'll set it up.

They are now triggered by the (nightly) CPython builds and the four
configurations run sequentially (there's an option for that), starting with
the C tests.

I would recommend configuring your own pyregr test jobs (if you have any)
for manual runs by disabling all of their triggers.

Stefan
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] CEP1000: Native dispatch through callables

2012-04-13 Thread mark florisson
On 13 April 2012 14:27, Dag Sverre Seljebotn  wrote:
> On 04/13/2012 03:01 PM, mark florisson wrote:
>>
>> On 13 April 2012 13:48, Stefan Behnel  wrote:
>>>
>>> Stefan Behnel, 13.04.2012 14:27:

 Dag Sverre Seljebotn, 13.04.2012 13:59:
>
> Requiring interning is somewhat less elegant in one way, but it makes a
> lot
> of other stuff much simpler.
>
> That gives us
>
> struct {
>     void *pointer;
>     PyBytesObject *signature;
> } record;
>
> and then you allocate a NULL-terminated arrays of these for all the
> overloads.


 However, the problem is the setup. These references will have to be
 created
 at init time and discarded during runtime termination. Not a problem for
 Cython generated code, but some overhead for hand written code.

 Since the size of these structs is not a problem, I'd prefer keeping
 Python
 objects out of the game and using an ssize_t ID instead, inferred from a
 char* signature at module init time by calling a C-API function. That
 avoids the need for any cleanup.
>>>
>>>
>>> Actually, we could even use interned char* values. Nothing keeps that
>>> C-API
>>> setup function from reassigning the "char* signature" field to the char*
>>> buffer of an internally allocated byte string. Except that we'd have to
>>> *require* users to use literals or otherwise statically allocated C
>>> strings
>>> in that field. Hmm, maybe not the best idea ever...
>>>
>>> Stefan
>>> ___
>>> cython-devel mailing list
>>> cython-devel@python.org
>>> http://mail.python.org/mailman/listinfo/cython-devel
>>
>>
>> You could create a module shared by all versions and projects, which
>> exposes a function 'get_signature', which given a char *signature
>> returns the pointer that should be used in the ABI signature type
>> information. You can then always compare by identity.
>
>
> I fail to see how this is different from what I proposed, with interning
> bytes objects (which I still prefer; although the binary-search features of
> direct comparison makes that attractive too).

It's not really different, more a response to Stefan's comment.

> BTW, any proposal that requires an actual project/library that both Cython
> and NumPy depends on will fail in the real world.

That's fine as long as they use the same way to expose ABI
information. As a courtesy though, we could do it anyway, which makes
it easier for those respective projects to understand what's involved,
how to implement it, and they can then decide whether they want to
ship that project as part of their own project.

> Dag
>
> ___
> cython-devel mailing list
> cython-devel@python.org
> http://mail.python.org/mailman/listinfo/cython-devel
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] CEP1000: Native dispatch through callables

2012-04-13 Thread Robert Bradshaw
On Fri, Apr 13, 2012 at 4:59 AM, Dag Sverre Seljebotn
 wrote:
> On 04/13/2012 01:38 PM, Stefan Behnel wrote:
>>
>> Robert Bradshaw, 13.04.2012 12:17:
>>>
>>> On Fri, Apr 13, 2012 at 1:52 AM, Dag Sverre Seljebotn wrote:

 On 04/13/2012 01:38 AM, Robert Bradshaw wrote:
>
> Have you given any thought as to what happens if __call__ is
> re-assigned for an object (or subclass of an object) supporting this
> interface? Or is this out of scope?


 Out-of-scope, I'd say. Though you can always write an object that
 detects if
 you assign to __call__...
>>
>>
>> +1 for out of scope. This is a pure C level feature.
>>
>>
> Minor nit: I don't think should_dereference is worth branching on, if
> one wants to save the allocation one can still use a variable-sized
> type and point to oneself. Yes, that's an extra dereference, but the
> memory is already likely close and it greatly simplifies the logic.
> But I could be wrong here.



 Those minor nits are exactly what I seek; since Travis will have the
 first
 implementation in numba<->SciPy, I just want to make sure that what he
 does
 will work efficiently work Cython.
>>>
>>>
>>> +1
>>>
>>> I have to admit building/invoking these var-arg-sized __nativecall__
>>> records seems painful. Here's another suggestion:
>>>
>>> struct {
>>>     void* pointer;
>>>     size_t signature; // compressed binary representation, 95% coverage
>
> Once you start passing around functions that take memory view slices as
> arguments, that 95% estimate will be off I think.

We have (on the high-performance systems we care about) 64-bits here.
If we limit ourselves to a 6-bit alphabet, that gives a trivial
encoding for up to 10 chars. We could be more clever here (Huffman
coding) but that might be overkill. More importantly though, the
"complicated" signatures are likely to be so cheap that the strcmp
overhead matters.

>>>     char* long_signature; // used if signature is not representable in
>>> a size_t, as indicated by signature = 0
>>> } record;
>>>
>>> These char* could optionally be allocated at the end of the record*
>>> for optimal locality. We could even dispense with the binary
>>> signature, but having that option allows us to avoid strcmp for stuff
>>> like d)d and ffi)f.
>>
>>
>> Assuming we use literals and a const char* for the signature, the C
>> compiler would cut down the number of signature strings automatically for
>> us. And a pointer comparison is the same as a size_t comparison.
>
>
> I'll go one further: Intern Python bytes objects. It's just a PyObject*, but
> it's *required* (or just strongly encouraged) to have gone through
>
> sig = sys.modules['_nativecall']['interned_db'].setdefault(sig, sig)
>
> Obviously in a PEP you'd have a C-API function for such interning
> (completely standalone utility). Performance of interning operation itself
> doesn't matter...
>
> Unless CPython has interning features itself, like in Java? Was that present
> back in the day and then ripped out?
>
> Requiring interning is somewhat less elegant in one way, but it makes a lot
> of other stuff much simpler.
>
> That gives us
>
> struct {
>    void *pointer;
>    PyBytesObject *signature;
> } record;
>
> and then you allocate a NULL-terminated arrays of these for all the
> overloads.

Global interning is a nice idea. The one drawback I see is that it
becomes much more expensive for dynamically calculated signatures.

>>
>> That would only apply at a per-module level, though, so it would require
>> an
>> indirection for the signature IDs. But it would avoid a global registry.
>>
>> Another idea would be to set the signature ID field to 0 at the beginning
>> and call a C-API function to let the current runtime assign an ID>  0,
>> unique for the currently running application. Then every user would only
>> have to parse the signature once to adapt to the respective ID and could
>> otherwise branch based on it directly.
>>
>> For Cython, we could generate a static ID variable for each typed call
>> that
>> we found in the sources. When encountering a C signature on a callable,
>> either a) the ID variable is still empty (initial case), then we parse the
>> signature to see if it matches the expected signature. If it does, we
>> assign the corresponding ID to the static ID variable and issue a direct
>> call. If b) the ID field is already set (normal case), we compare the
>> signature IDs directly and issue a C call it they match. If the IDs do not
>> match, we issue a normal Python call.

If I understand correctly, you're proposing

struct {
  char* sig;
  long id;
} sig_t;

Where comparison would (sometimes?) compute id from sig by augmenting
a global counter and dict? Might be expensive to bootstrap, but
eventually all relevant ids would be filled in and it would be quick.
Interesting. I wonder what the performance penalty would be over
assuming id is statically computed lots of the time, and u

Re: [Cython] CEP1000: Native dispatch through callables

2012-04-13 Thread Robert Bradshaw
On Fri, Apr 13, 2012 at 5:48 AM, mark florisson
 wrote:
> On 13 April 2012 12:59, Dag Sverre Seljebotn  
> wrote:
>> On 04/13/2012 01:38 PM, Stefan Behnel wrote:
>>>
>>> Robert Bradshaw, 13.04.2012 12:17:

 On Fri, Apr 13, 2012 at 1:52 AM, Dag Sverre Seljebotn wrote:
>
> On 04/13/2012 01:38 AM, Robert Bradshaw wrote:
>>
>> Have you given any thought as to what happens if __call__ is
>> re-assigned for an object (or subclass of an object) supporting this
>> interface? Or is this out of scope?
>
>
> Out-of-scope, I'd say. Though you can always write an object that
> detects if
> you assign to __call__...
>>>
>>>
>>> +1 for out of scope. This is a pure C level feature.
>>>
>>>
>> Minor nit: I don't think should_dereference is worth branching on, if
>> one wants to save the allocation one can still use a variable-sized
>> type and point to oneself. Yes, that's an extra dereference, but the
>> memory is already likely close and it greatly simplifies the logic.
>> But I could be wrong here.
>
>
>
> Those minor nits are exactly what I seek; since Travis will have the
> first
> implementation in numba<->SciPy, I just want to make sure that what he
> does
> will work efficiently work Cython.


 +1

 I have to admit building/invoking these var-arg-sized __nativecall__
 records seems painful. Here's another suggestion:

 struct {
     void* pointer;
     size_t signature; // compressed binary representation, 95% coverage
>>
>>
>> Once you start passing around functions that take memory view slices as
>> arguments, that 95% estimate will be off I think.
>>
>
> It kind of depends on which arguments types and how many arguments you
> will allow, and whether or not collisions would be fine (which would
> imply ID comparison + strcmp()).

Interesting idea, though this has the drawback of doubling (at least)
the overhead of the simple (important) case as well as memory
requirements/locality issues.

     char* long_signature; // used if signature is not representable in
 a size_t, as indicated by signature = 0
 } record;

 These char* could optionally be allocated at the end of the record*
 for optimal locality. We could even dispense with the binary
 signature, but having that option allows us to avoid strcmp for stuff
 like d)d and ffi)f.
>>>
>>>
>>> Assuming we use literals and a const char* for the signature, the C
>>> compiler would cut down the number of signature strings automatically for
>>> us. And a pointer comparison is the same as a size_t comparison.
>>
>>
>> I'll go one further: Intern Python bytes objects. It's just a PyObject*, but
>> it's *required* (or just strongly encouraged) to have gone through
>>
>> sig = sys.modules['_nativecall']['interned_db'].setdefault(sig, sig)
>>
>> Obviously in a PEP you'd have a C-API function for such interning
>> (completely standalone utility). Performance of interning operation itself
>> doesn't matter...
>>
>> Unless CPython has interning features itself, like in Java? Was that present
>> back in the day and then ripped out?
>>
>> Requiring interning is somewhat less elegant in one way, but it makes a lot
>> of other stuff much simpler.
>>
>> That gives us
>>
>> struct {
>>    void *pointer;
>>    PyBytesObject *signature;
>> } record;
>>
>> and then you allocate a NULL-terminated arrays of these for all the
>> overloads.
>>
>
> Interesting. What I like about size_t it that it could define a
> deterministic ordering, which means specializations could be stored in
> a binary search tree in array form.

I think the number of specializations would have to be quite large
(>10, maybe 100) before a binary search wins out over a simple scan,
but if we stored a count rather than did a null-terminated array teh
lookup function could take this into account. (The header will already
have plenty of room if we're storing a version number and want the
records to be properly aligned.)

Requiring them to be sorted would also allow us to abort on average
half way through a scan. Of course prioritizing the "likely"
signatures first may be more of a win.

> Cython would precompute the size_t
> for the specialization it needs (and maybe account for promotions as
> well).

Exactly.

>>> That would only apply at a per-module level, though, so it would require
>>> an
>>> indirection for the signature IDs. But it would avoid a global registry.
>>>
>>> Another idea would be to set the signature ID field to 0 at the beginning
>>> and call a C-API function to let the current runtime assign an ID>  0,
>>> unique for the currently running application. Then every user would only
>>> have to parse the signature once to adapt to the respective ID and could
>>> otherwise branch based on it directly.
>>>
>>> For Cython, we could generate a static ID variable for each typed call
>>> that
>>> we found in the sources. When encounteri

Re: [Cython] CEP1000: Native dispatch through callables

2012-04-13 Thread Robert Bradshaw
On Fri, Apr 13, 2012 at 10:26 AM, Robert Bradshaw  wrote:
> On Fri, Apr 13, 2012 at 4:59 AM, Dag Sverre Seljebotn
>  wrote:
>> On 04/13/2012 01:38 PM, Stefan Behnel wrote:

>>> That would only apply at a per-module level, though, so it would require
>>> an
>>> indirection for the signature IDs. But it would avoid a global registry.
>>>
>>> Another idea would be to set the signature ID field to 0 at the beginning
>>> and call a C-API function to let the current runtime assign an ID>  0,
>>> unique for the currently running application. Then every user would only
>>> have to parse the signature once to adapt to the respective ID and could
>>> otherwise branch based on it directly.
>>>
>>> For Cython, we could generate a static ID variable for each typed call
>>> that
>>> we found in the sources. When encountering a C signature on a callable,
>>> either a) the ID variable is still empty (initial case), then we parse the
>>> signature to see if it matches the expected signature. If it does, we
>>> assign the corresponding ID to the static ID variable and issue a direct
>>> call. If b) the ID field is already set (normal case), we compare the
>>> signature IDs directly and issue a C call it they match. If the IDs do not
>>> match, we issue a normal Python call.
>
> If I understand correctly, you're proposing
>
> struct {
>  char* sig;
>  long id;
> } sig_t;
>
> Where comparison would (sometimes?) compute id from sig by augmenting
> a global counter and dict? Might be expensive to bootstrap, but
> eventually all relevant ids would be filled in and it would be quick.
> Interesting. I wonder what the performance penalty would be over
> assuming id is statically computed lots of the time, and using that to
> compare against fixed values. And there's memory locality issues as
> well.

To clarify, I'd really like to have the following as fast as possible:

if (callable.sig.id == X) {
   // yep, that's what I thought
} else {
   // generic call
}

Alternatively, one can imagine wanting to do:

switch (callable.sig.id) {
case X:
// I can do this
case Y:
// this is common and fast as well
...
default:
// generic call
}

There is some question about how promotion should work (e.g. should
this flexibility reside in the caller or the callee (or both, though
that could result in a quadratic number of comparisons)?)

- Robert
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] CEP1000: Native dispatch through callables

2012-04-13 Thread Stefan Behnel
Stefan Behnel, 13.04.2012 07:24:
> Dag Sverre Seljebotn, 13.04.2012 00:34:
>> On 04/13/2012 12:11 AM, Dag Sverre Seljebotn wrote:
>> http://wiki.cython.org/enhancements/cep1000
> 
> I'm all for doing something in this direction and have been hinting at it
> on the PyPy mailing list for a while, without reaction so far. I'll trigger
> them again, with a pointer to this discussion and the CEP. PyPy should be
> totally interested in a generic way to do fast calls into wrapped C code in
> general and Cython implemented functions specifically. Their JIT would then
> look at the function at runtime and unwrap it.

I just learned that the support in PyPy would be rather straight forward.
It already supports calling native code with a known signature through
their "rlib/libffi.py" module, so all that remains to be done on their side
is mapping the encoded signature to their own signature configuration.

Stefan
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] CEP1000: Native dispatch through callables

2012-04-13 Thread Robert Bradshaw
On Fri, Apr 13, 2012 at 12:15 PM, Stefan Behnel  wrote:
> Stefan Behnel, 13.04.2012 07:24:
>> Dag Sverre Seljebotn, 13.04.2012 00:34:
>>> On 04/13/2012 12:11 AM, Dag Sverre Seljebotn wrote:
>>> http://wiki.cython.org/enhancements/cep1000
>>
>> I'm all for doing something in this direction and have been hinting at it
>> on the PyPy mailing list for a while, without reaction so far. I'll trigger
>> them again, with a pointer to this discussion and the CEP. PyPy should be
>> totally interested in a generic way to do fast calls into wrapped C code in
>> general and Cython implemented functions specifically. Their JIT would then
>> look at the function at runtime and unwrap it.
>
> I just learned that the support in PyPy would be rather straight forward.
> It already supports calling native code with a known signature through
> their "rlib/libffi.py" module,

Cool.

> so all that remains to be done on their side
> is mapping the encoded signature to their own signature configuration.

Or looking into borrowing theirs? (We might want more extensibility,
e.g. declaring buffer types and nogil/exception data. I assume ctypes
has a signature declaration format as well, right?)

- Robert
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] CEP1000: Native dispatch through callables

2012-04-13 Thread Stefan Behnel
Robert Bradshaw, 13.04.2012 21:26:
> On Fri, Apr 13, 2012 at 12:15 PM, Stefan Behnel wrote:
>> Stefan Behnel, 13.04.2012 07:24:
>>> Dag Sverre Seljebotn, 13.04.2012 00:34:
 On 04/13/2012 12:11 AM, Dag Sverre Seljebotn wrote:
 http://wiki.cython.org/enhancements/cep1000
>>>
>>> I'm all for doing something in this direction and have been hinting at it
>>> on the PyPy mailing list for a while, without reaction so far. I'll trigger
>>> them again, with a pointer to this discussion and the CEP. PyPy should be
>>> totally interested in a generic way to do fast calls into wrapped C code in
>>> general and Cython implemented functions specifically. Their JIT would then
>>> look at the function at runtime and unwrap it.
>>
>> I just learned that the support in PyPy would be rather straight forward.
>> It already supports calling native code with a known signature through
>> their "rlib/libffi.py" module,
> 
> Cool.
> 
>> so all that remains to be done on their side
>> is mapping the encoded signature to their own signature configuration.
> 
> Or looking into borrowing theirs? (We might want more extensibility,
> e.g. declaring buffer types and nogil/exception data. I assume ctypes
> has a signature declaration format as well, right?)

PyPy's ctypes implementation is based on libffi. However, I think neither
of the two has a declaration format (e.g. string based) other than the
object based declaration notation. You basically pass them a sequence of
type objects to declare the signature. That's not really easy to map to the
C level - at least not efficiently...

Stefan
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] CEP1000: Native dispatch through callables

2012-04-13 Thread Stefan Behnel
Robert Bradshaw, 13.04.2012 20:21:
> On Fri, Apr 13, 2012 at 10:26 AM, Robert Bradshaw wrote:
>> On Fri, Apr 13, 2012 at 4:59 AM, Dag Sverre Seljebotn wrote:
>>> On 04/13/2012 01:38 PM, Stefan Behnel wrote:
 That would only apply at a per-module level, though, so it would
 require an indirection for the signature IDs. But it would avoid a
 global registry.

 Another idea would be to set the signature ID field to 0 at the beginning
 and call a C-API function to let the current runtime assign an ID>  0,
 unique for the currently running application. Then every user would only
 have to parse the signature once to adapt to the respective ID and could
 otherwise branch based on it directly.

 For Cython, we could generate a static ID variable for each typed call
 that
 we found in the sources. When encountering a C signature on a callable,
 either a) the ID variable is still empty (initial case), then we parse the
 signature to see if it matches the expected signature. If it does, we
 assign the corresponding ID to the static ID variable and issue a direct
 call. If b) the ID field is already set (normal case), we compare the
 signature IDs directly and issue a C call it they match. If the IDs do not
 match, we issue a normal Python call.
>>
>> If I understand correctly, you're proposing
>>
>> struct {
>>  char* sig;
>>  long id;
>> } sig_t;
>>
>> Where comparison would (sometimes?) compute id from sig by augmenting
>> a global counter and dict? Might be expensive to bootstrap, but
>> eventually all relevant ids would be filled in and it would be quick.

Yes. If a function is only called once, the overhead won't matter. And
starting from the second call, it would either be fast if the function
signature matches or slow anyway if it doesn't match.


>> Interesting. I wonder what the performance penalty would be over
>> assuming id is statically computed lots of the time, and using that to
>> compare against fixed values. And there's memory locality issues as
>> well.
> 
> To clarify, I'd really like to have the following as fast as possible:
> 
> if (callable.sig.id == X) {
>// yep, that's what I thought
> } else {
>// generic call
> }
> 
> Alternatively, one can imagine wanting to do:
> 
> switch (callable.sig.id) {
> case X:
> // I can do this
> case Y:
> // this is common and fast as well
> ...
> default:
> // generic call
> }

Yes, that's the idea.


> There is some question about how promotion should work (e.g. should
> this flexibility reside in the caller or the callee (or both, though
> that could result in a quadratic number of comparisons)?)

Callees could expose multiple signatures (which would result in a direct
call for each, without further comparisons), then the caller would have to
choose between those. However, if none matches exactly, the caller might
want to promote its arguments and try more signatures. In any case, it's
the caller that does the work, never the callee.

We could generate code like this:

/* cdef int x = ...
 * cdef long y = ...
 * cdef int z   # interesting: what if z is not typed?
 * z = func(x, y)
 */

if (func.sig.id == id("[int,long] -> int")) {
 z = ((cast)func.cfunc) (x,y);
} else if (sizeof(long) > sizeof(int) &&
   (func.sig.id == id("[long,long] -> int"))) {
 z = ((cast)func.cfunc) ((long)x, y);
} etc. ... else {
 /* pack and call as Python function */
}

Meaning, the C compiler could reduce the amount of optimistic call code at
compile time.

Stefan
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] CEP1000: Native dispatch through callables

2012-04-13 Thread Dag Sverre Seljebotn
Ah, I didn't think about 6-bit or huffman. Certainly helps.

I'm almost +1 on your proposal now, but a couple of more ideas:

1) Let the key (the size_t) spill over to the next specialization entry if it 
is too large; and prepend that key with a continuation code (two size-ts could 
together say "iii)-d\0\0" on 32 bit systems with 8bit encoding, using - as 
continuation). The key-based caller will expect a continuation if it knows 
about the specialization, and the prepended char will prevent spurios matches 
against the overspilled slot.

We could even use the pointers for part of the continuation...

2) Separate the char* format strings from the keys, ie this memory layout:

Version,nslots,nspecs,funcptr,key,funcptr,key,...,sigcharptr,sigcharptr...

Where nslots is larger than nspecs if there are continuations.

OK, this is getting close to my original proposal, but the difference is the 
contiunation char, so that if you expect a short signature, you can safely scan 
every slot and branching and no null-checking necesarry.

Dag
-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Robert Bradshaw  wrote:

On Fri, Apr 13, 2012 at 4:59 AM, Dag Sverre Seljebotn
 wrote:
> On 04/13/2012 01:38 PM, Stefan Behnel wrote:
>>
>> Robert Bradshaw, 13.04.2012 12:17:
>>>
>>> On Fri, Apr 13, 2012 at 1:52 AM, Dag Sverre Seljebotn wrote:

 On 04/13/2012 01:38 AM, Robert Bradshaw wrote:
>
> Have you given any thought as to what happens if __call__ is
> re-assigned for an object (or subclass of an object) supporting this
> interface? Or is this out of scope?


 Out-of-scope, I'd say. Though you can always write an object that
 detects if
 you assign to __call__...
>>
>>
>> +1 for out of scope. This is a pure C level feature.
>>
>>
> Minor nit: I don't think should_dereference is worth branching on, if
> one wants to save the allocation one can still use a variable-sized
> type and point to oneself. Yes, that's an extra dereference, but the
> memory is already likely close and it greatly simplifies the logic.
> But I could be wrong here.



 Those minor nits are exactly what I seek; since Travis will have the
 first
 implementation in numba<->SciPy, I just want to make sure that what he
 does
 will work efficiently work Cython.
>>>
>>>
>>> +1
>>>
>>> I have to admit building/invoking these var-arg-sized __nativecall__
>>> records seems painful. Here's another suggestion:
>>>
>>> struct {
>>> void* pointer;
>>> size_t signature; // compressed binary representation, 95% coverage
>
> Once you start passing around functions that take memory view slices as
> arguments, that 95% estimate will be off I think.

We have (on the high-performance systems we care about) 64-bits here.
If we limit ourselves to a 6-bit alphabet, that gives a trivial
encoding for up to 10 chars. We could be more clever here (Huffman
coding) but that might be overkill. More importantly though, the
"complicated" signatures are likely to be so cheap that the strcmp
overhead matters.

>>> char* long_signature; // used if signature is not representable in
>>> a size_t, as indicated by signature = 0
>>> } record;
>>>
>>> These char* could optionally be allocated at the end of the record*
>>> for optimal locality. We could even dispense with the binary
>>> signature, but having that option allows us to avoid strcmp for stuff
>>> like d)d and ffi)f.
>>
>>
>> Assuming we use literals and a const char* for the signature, the C
>> compiler would cut down the number of signature strings automatically for
>> us. And a pointer comparison is the same as a size_t comparison.
>
>
> I'll go one further: Intern Python bytes objects. It's just a PyObject*, but
> it's *required* (or just strongly encouraged) to have gone through
>
> sig = sys.modules['_nativecall']['interned_db'].setdefault(sig, sig)
>
> Obviously in a PEP you'd have a C-API function for such interning
> (completely standalone utility). Performance of interning operation itself
> doesn't matter...
>
> Unless CPython has interning features itself, like in Java? Was that present
> back in the day and then ripped out?
>
> Requiring interning is somewhat less elegant in one way, but it makes a lot
> of other stuff much simpler.
>
> That gives us
>
> struct {
>void *pointer;
>PyBytesObject *signature;
> } record;
>
> and then you allocate a NULL-terminated arrays of these for all the
> overloads.

Global interning is a nice idea. The one drawback I see is that it
becomes much more expensive for dynamically calculated signatures.

>>
>> That would only apply at a per-module level, though, so it would require
>> an
>> indirection for the signature IDs. But it would avoid a global registry.
>>
>> Another idea would be to set the signature ID field to 0 at the beginning
>> and call a C-API function to let the current runtime assign an ID>  0,
>> unique for the currently running applica

Re: [Cython] CEP1000: Native dispatch through callables

2012-04-13 Thread Robert Bradshaw
On Fri, Apr 13, 2012 at 12:52 PM, Stefan Behnel  wrote:
> Robert Bradshaw, 13.04.2012 20:21:
>> On Fri, Apr 13, 2012 at 10:26 AM, Robert Bradshaw wrote:
>>> On Fri, Apr 13, 2012 at 4:59 AM, Dag Sverre Seljebotn wrote:
 On 04/13/2012 01:38 PM, Stefan Behnel wrote:
> That would only apply at a per-module level, though, so it would
> require an indirection for the signature IDs. But it would avoid a
> global registry.
>
> Another idea would be to set the signature ID field to 0 at the beginning
> and call a C-API function to let the current runtime assign an ID>  0,
> unique for the currently running application. Then every user would only
> have to parse the signature once to adapt to the respective ID and could
> otherwise branch based on it directly.
>
> For Cython, we could generate a static ID variable for each typed call
> that
> we found in the sources. When encountering a C signature on a callable,
> either a) the ID variable is still empty (initial case), then we parse the
> signature to see if it matches the expected signature. If it does, we
> assign the corresponding ID to the static ID variable and issue a direct
> call. If b) the ID field is already set (normal case), we compare the
> signature IDs directly and issue a C call it they match. If the IDs do not
> match, we issue a normal Python call.
>>>
>>> If I understand correctly, you're proposing
>>>
>>> struct {
>>>  char* sig;
>>>  long id;
>>> } sig_t;
>>>
>>> Where comparison would (sometimes?) compute id from sig by augmenting
>>> a global counter and dict? Might be expensive to bootstrap, but
>>> eventually all relevant ids would be filled in and it would be quick.
>
> Yes. If a function is only called once, the overhead won't matter. And
> starting from the second call, it would either be fast if the function
> signature matches or slow anyway if it doesn't match.

There's still data locality issues, including the cached id for the
caller as well as the callee.

>>> Interesting. I wonder what the performance penalty would be over
>>> assuming id is statically computed lots of the time, and using that to
>>> compare against fixed values. And there's memory locality issues as
>>> well.
>>
>> To clarify, I'd really like to have the following as fast as possible:
>>
>> if (callable.sig.id == X) {
>>    // yep, that's what I thought
>> } else {
>>    // generic call
>> }
>>
>> Alternatively, one can imagine wanting to do:
>>
>> switch (callable.sig.id) {
>>     case X:
>>         // I can do this
>>     case Y:
>>         // this is common and fast as well
>>     ...
>>     default:
>>         // generic call
>> }
>
> Yes, that's the idea.
>
>
>> There is some question about how promotion should work (e.g. should
>> this flexibility reside in the caller or the callee (or both, though
>> that could result in a quadratic number of comparisons)?)
>
> Callees could expose multiple signatures (which would result in a direct
> call for each, without further comparisons), then the caller would have to
> choose between those. However, if none matches exactly, the caller might
> want to promote its arguments and try more signatures. In any case, it's
> the caller that does the work, never the callee.
>
> We could generate code like this:
>
>    /* cdef int x = ...
>     * cdef long y = ...
>     * cdef int z       # interesting: what if z is not typed?
>     * z = func(x, y)
>     */
>
>    if (func.sig.id == id("[int,long] -> int")) {
>         z = ((cast)func.cfunc) (x,y);
>    } else if (sizeof(long) > sizeof(int) &&
>               (func.sig.id == id("[long,long] -> int"))) {
>         z = ((cast)func.cfunc) ((long)x, y);
>    } etc. ... else {
>         /* pack and call as Python function */
>    }
>
> Meaning, the C compiler could reduce the amount of optimistic call code at
> compile time.

Interesting idea. Alternatively, I wonder if the signature could
reflect exactly-sized types rather than int/long/etc. Perhaps that
would make the code more complicated on both ends...

I'm assuming your id(...) is computed at compile time in this example,
right? Otherwise it would get a bit messier.

- Robert
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] CEP1000: Native dispatch through callables

2012-04-13 Thread Robert Bradshaw
On Fri, Apr 13, 2012 at 1:27 PM, Dag Sverre Seljebotn
 wrote:
> Ah, I didn't think about 6-bit or huffman. Certainly helps.

Yeah, we don't want to complicate the ABI too much, but I think
something like 8 4-bit common chars and 32 6-bit other chars (or 128
8-bit other chars) wouldn't be outrageous. The fact that we only have
to encode into a single word makes the algorithm very simple (though
the majority of the time we'd spit out pre-encoded literals). We have
a version number to play with this as well.

> I'm almost +1 on your proposal now, but a couple of more ideas:
>
> 1) Let the key (the size_t) spill over to the next specialization entry if
> it is too large; and prepend that key with a continuation code (two size-ts
> could together say "iii)-d\0\0" on 32 bit systems with 8bit encoding, using
> - as continuation). The key-based caller will expect a continuation if it
> knows about the specialization, and the prepended char will prevent spurios
> matches against the overspilled slot.
>
> We could even use the pointers for part of the continuation...
>
> 2) Separate the char* format strings from the keys, ie this memory layout:
>
> Version,nslots,nspecs,funcptr,key,funcptr,key,...,sigcharptr,sigcharptr...
>
> Where nslots is larger than nspecs if there are continuations.
>
> OK, this is getting close to my original proposal, but the difference is the
> contiunation char, so that if you expect a short signature, you can safely
> scan every slot and branching and no null-checking necesarry.

I don't think we need nslots (though it might be interesting). My
thought is that once you start futzing with variable-length keys, you
might as well just compare char*s.

If one is concerned about memory, one could force the sigcharptr to be
aligned, and then the "keys" could be either sigcharptr or key
depending on whether the least significant bit was set. One could
easily scan for/switch on a key and scanning for a char* would be
almost as easy (just don't dereference if the lsb is set).

I don't see us being memory constrained, so

(version,nspecs,futureuse),(key,sigcharptr,funcptr)*,optionalsigchardata*

seems fine to me even if only one of key/sigchrptr is ever used per
spec. Null-terminating the specs would work fine as well (one less
thing to keep track of during iteration).

- Robert
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] CEP1000: Native dispatch through callables

2012-04-13 Thread Nathaniel Smith
On Fri, Apr 13, 2012 at 9:27 PM, Dag Sverre Seljebotn
 wrote:
> Ah, I didn't think about 6-bit or huffman. Certainly helps.
>
> I'm almost +1 on your proposal now, but a couple of more ideas:
>
> 1) Let the key (the size_t) spill over to the next specialization entry if
> it is too large; and prepend that key with a continuation code (two size-ts
> could together say "iii)-d\0\0" on 32 bit systems with 8bit encoding, using
> - as continuation). The key-based caller will expect a continuation if it
> knows about the specialization, and the prepended char will prevent spurios
> matches against the overspilled slot.
>
> We could even use the pointers for part of the continuation...

I am really lost here. Why is any of this complicated encoding stuff
better than interning? Interning takes one line of code, is incredibly
cheap (one dict lookup per call site and function definition), and it
lets you check any possible signature (even complicated ones involving
memoryviews) by doing a single-word comparison. And best of all, you
don't have to think hard to make sure you got the encoding right. ;-)

On a 32-bit system, pointers are smaller than a size_t, but more
expressive! You can still do binary search if you want, etc. Is the
problem just that interning requires a runtime calculation? Because I
feel like C users (like numpy) will want to compute these compressed
codes at module-init anyway, and those of us with a fancy compiler
capable of computing them ahead of time (like Cython) can instruct
that fancy compiler to compute them at module-init time just as
easily?

-- Nathaniel
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] CEP1000: Native dispatch through callables

2012-04-13 Thread Robert Bradshaw
On Fri, Apr 13, 2012 at 2:24 PM, Nathaniel Smith  wrote:
> On Fri, Apr 13, 2012 at 9:27 PM, Dag Sverre Seljebotn
>  wrote:
>> Ah, I didn't think about 6-bit or huffman. Certainly helps.
>>
>> I'm almost +1 on your proposal now, but a couple of more ideas:
>>
>> 1) Let the key (the size_t) spill over to the next specialization entry if
>> it is too large; and prepend that key with a continuation code (two size-ts
>> could together say "iii)-d\0\0" on 32 bit systems with 8bit encoding, using
>> - as continuation). The key-based caller will expect a continuation if it
>> knows about the specialization, and the prepended char will prevent spurios
>> matches against the overspilled slot.
>>
>> We could even use the pointers for part of the continuation...
>
> I am really lost here. Why is any of this complicated encoding stuff
> better than interning? Interning takes one line of code, is incredibly
> cheap (one dict lookup per call site and function definition), and it
> lets you check any possible signature (even complicated ones involving
> memoryviews) by doing a single-word comparison. And best of all, you
> don't have to think hard to make sure you got the encoding right. ;-)
>
> On a 32-bit system, pointers are smaller than a size_t, but more
> expressive! You can still do binary search if you want, etc. Is the
> problem just that interning requires a runtime calculation? Because I
> feel like C users (like numpy) will want to compute these compressed
> codes at module-init anyway, and those of us with a fancy compiler
> capable of computing them ahead of time (like Cython) can instruct
> that fancy compiler to compute them at module-init time just as
> easily?

Good question.

The primary disadvantage of interning that I see is memory locality. I
suppose if all the C-level caches of interned values were co-located,
this may not be as big of an issue. Not being able to compare against
compile-time constants may thwart some optimization opportunities, but
that's less clear.

It also requires coordination common repository, but I suppose one
would just stick a set in some standard module (or leverage Python's
interning).

- Robert
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] CEP1000: Native dispatch through callables

2012-04-13 Thread Dag Sverre Seljebotn


Robert Bradshaw  wrote:

>On Fri, Apr 13, 2012 at 1:27 PM, Dag Sverre Seljebotn
> wrote:
>> Ah, I didn't think about 6-bit or huffman. Certainly helps.
>
>Yeah, we don't want to complicate the ABI too much, but I think
>something like 8 4-bit common chars and 32 6-bit other chars (or 128
>8-bit other chars) wouldn't be outrageous. The fact that we only have
>to encode into a single word makes the algorithm very simple (though
>the majority of the time we'd spit out pre-encoded literals). We have
>a version number to play with this as well.
>
>> I'm almost +1 on your proposal now, but a couple of more ideas:
>>
>> 1) Let the key (the size_t) spill over to the next specialization
>entry if
>> it is too large; and prepend that key with a continuation code (two
>size-ts
>> could together say "iii)-d\0\0" on 32 bit systems with 8bit encoding,
>using
>> - as continuation). The key-based caller will expect a continuation
>if it
>> knows about the specialization, and the prepended char will prevent
>spurios
>> matches against the overspilled slot.
>>
>> We could even use the pointers for part of the continuation...
>>
>> 2) Separate the char* format strings from the keys, ie this memory
>layout:
>>
>>
>Version,nslots,nspecs,funcptr,key,funcptr,key,...,sigcharptr,sigcharptr...
>>
>> Where nslots is larger than nspecs if there are continuations.
>>
>> OK, this is getting close to my original proposal, but the difference
>is the
>> contiunation char, so that if you expect a short signature, you can
>safely
>> scan every slot and branching and no null-checking necesarry.
>
>I don't think we need nslots (though it might be interesting). My
>thought is that once you start futzing with variable-length keys, you
>might as well just compare char*s.

This is where we disagree. If you are the caller you know at compile-time how 
much you want to match; I think comparing 2 or 3 size-t with no looping is a 
lot better (a fully-unrolled, 64-bit per instruction strcmp with one of the 
operands known to the compiler...).


>
>If one is concerned about memory, one could force the sigcharptr to be
>aligned, and then the "keys" could be either sigcharptr or key
>depending on whether the least significant bit was set. One could
>easily scan for/switch on a key and scanning for a char* would be
>almost as easy (just don't dereference if the lsb is set).
>
>I don't see us being memory constrained, so
>
>(version,nspecs,futureuse),(key,sigcharptr,funcptr)*,optionalsigchardata*
>
>seems fine to me even if only one of key/sigchrptr is ever used per
>spec. Null-terminating the specs would work fine as well (one less
>thing to keep track of during iteration).

Well, can't one always use more L1 cache, or is that not a concern? If you have 
5-6 different routines calling each other using this mechanism, each with 
multiple specializations, those unused slots translate to many cache lines 
wasted.

I don't think it is that important, I just think that how pretty the C struct 
declaration ends up looking should not be a concern at all, when the whole 
point of this is speed anyway. You can always just use a throwaway struct 
declaration and a cast to get whatever layout you need. If the 'padding' leads 
to less branching then fine, but I don't see that it helps in any way.

To refine my proposal a bit, we have a list of variable size entries,

(keydata, keydata, ..., funcptr)

where each keydata and the ptr is 64 bits on all platforms (see below); each 
entry must have a total length multiple of 128 bits (so that one can safely 
scan for a signature in 128 bit increments in the data *without* parsing or 
branching, you'll never hit a pointer), and each key but the first starts with 
a 'dash'.

Signature strings are either kept separate, or even parsed/decoded from the 
keys. We really only care about speed when you have compiled or JITed code for 
the case, decoding should be fine otherwise. 

BTW, won't the Cython-generated C code be a horrible mess if we use size-t 
rather than insist on int64t? (ok, those need some ifdefs for various 
compilers, but still seem cleaner than operating with 32bit and 64bit keys, and 
stdint.h is winning ground).

Dag



>
>- Robert
>___
>cython-devel mailing list
>cython-devel@python.org
>http://mail.python.org/mailman/listinfo/cython-devel

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] CEP1000: Native dispatch through callables

2012-04-13 Thread Dag Sverre Seljebotn


Robert Bradshaw  wrote:

>On Fri, Apr 13, 2012 at 2:24 PM, Nathaniel Smith  wrote:
>> On Fri, Apr 13, 2012 at 9:27 PM, Dag Sverre Seljebotn
>>  wrote:
>>> Ah, I didn't think about 6-bit or huffman. Certainly helps.
>>>
>>> I'm almost +1 on your proposal now, but a couple of more ideas:
>>>
>>> 1) Let the key (the size_t) spill over to the next specialization
>entry if
>>> it is too large; and prepend that key with a continuation code (two
>size-ts
>>> could together say "iii)-d\0\0" on 32 bit systems with 8bit
>encoding, using
>>> - as continuation). The key-based caller will expect a continuation
>if it
>>> knows about the specialization, and the prepended char will prevent
>spurios
>>> matches against the overspilled slot.
>>>
>>> We could even use the pointers for part of the continuation...
>>
>> I am really lost here. Why is any of this complicated encoding stuff
>> better than interning? Interning takes one line of code, is
>incredibly
>> cheap (one dict lookup per call site and function definition), and it
>> lets you check any possible signature (even complicated ones
>involving
>> memoryviews) by doing a single-word comparison. And best of all, you
>> don't have to think hard to make sure you got the encoding right. ;-)
>>
>> On a 32-bit system, pointers are smaller than a size_t, but more
>> expressive! You can still do binary search if you want, etc. Is the
>> problem just that interning requires a runtime calculation? Because I
>> feel like C users (like numpy) will want to compute these compressed
>> codes at module-init anyway, and those of us with a fancy compiler
>> capable of computing them ahead of time (like Cython) can instruct
>> that fancy compiler to compute them at module-init time just as
>> easily?
>
>Good question.
>
>The primary disadvantage of interning that I see is memory locality. I
>suppose if all the C-level caches of interned values were co-located,
>this may not be as big of an issue. Not being able to compare against
>compile-time constants may thwart some optimization opportunities, but
>that's less clear.
>
>It also requires coordination common repository, but I suppose one
>would just stick a set in some standard module (or leverage Python's
>interning).

More problems:

1) It doesn't work well with multiple interpreter states. Ok, nothing works 
with that at the moment, but it is on the roadmap for Python and we should not 
make it worse.

You basically *need* a thread safe store separate from any python interpreter; 
though pythread.h does not rely on the interpreter state; which helps.

2) you end up with the known comparison values in read-write memory segments 
rather than readonly segments, which is probably worse on multicore systems?

I really think that anything that we can do to make this near-c-speed should be 
done; none of the proposals are *that* complicated.

Using keys, NumPy can in the C code choose to be slower but more readable; but 
using interned string forces cython to be slower, cython gets no way of 
choosing to go faster. (to the degree that it has an effect; none of these 
claims were checked)

Dag


>
>- Robert
>___
>cython-devel mailing list
>cython-devel@python.org
>http://mail.python.org/mailman/listinfo/cython-devel

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] CEP1000: Native dispatch through callables

2012-04-13 Thread Dag Sverre Seljebotn


Dag Sverre Seljebotn  wrote:

>
>
>Robert Bradshaw  wrote:
>
>>On Fri, Apr 13, 2012 at 2:24 PM, Nathaniel Smith 
>wrote:
>>> On Fri, Apr 13, 2012 at 9:27 PM, Dag Sverre Seljebotn
>>>  wrote:
 Ah, I didn't think about 6-bit or huffman. Certainly helps.

 I'm almost +1 on your proposal now, but a couple of more ideas:

 1) Let the key (the size_t) spill over to the next specialization
>>entry if
 it is too large; and prepend that key with a continuation code (two
>>size-ts
 could together say "iii)-d\0\0" on 32 bit systems with 8bit
>>encoding, using
 - as continuation). The key-based caller will expect a continuation
>>if it
 knows about the specialization, and the prepended char will prevent
>>spurios
 matches against the overspilled slot.

 We could even use the pointers for part of the continuation...
>>>
>>> I am really lost here. Why is any of this complicated encoding stuff
>>> better than interning? Interning takes one line of code, is
>>incredibly
>>> cheap (one dict lookup per call site and function definition), and
>it
>>> lets you check any possible signature (even complicated ones
>>involving
>>> memoryviews) by doing a single-word comparison. And best of all, you
>>> don't have to think hard to make sure you got the encoding right.
>;-)
>>>
>>> On a 32-bit system, pointers are smaller than a size_t, but more
>>> expressive! You can still do binary search if you want, etc. Is the
>>> problem just that interning requires a runtime calculation? Because
>I
>>> feel like C users (like numpy) will want to compute these compressed
>>> codes at module-init anyway, and those of us with a fancy compiler
>>> capable of computing them ahead of time (like Cython) can instruct
>>> that fancy compiler to compute them at module-init time just as
>>> easily?
>>
>>Good question.
>>
>>The primary disadvantage of interning that I see is memory locality. I
>>suppose if all the C-level caches of interned values were co-located,
>>this may not be as big of an issue. Not being able to compare against
>>compile-time constants may thwart some optimization opportunities, but
>>that's less clear.
>>
>>It also requires coordination common repository, but I suppose one
>>would just stick a set in some standard module (or leverage Python's
>>interning).
>
>More problems:
>
>1) It doesn't work well with multiple interpreter states. Ok, nothing
>works with that at the moment, but it is on the roadmap for Python and
>we should not make it worse.
>
>You basically *need* a thread safe store separate from any python
>interpreter; though pythread.h does not rely on the interpreter state;
>which helps.

No, it doesn't, unless we want to ship a single(!) .so-file that can be 
depended upon by all relevant projects. There's just no way for loaded modules 
to communicate and synchronize that they know about this CEP except through an 
interpreter...

That's almost impossible to work around in any clean way? (I can think of 
several very ugly ones...) Unless the multiple interpreter state idea is 
entirely dead in CPython, interning must be done seperately for each 
interpreter and the values stored in the module object. Ugh. 


Dag

>
>2) you end up with the known comparison values in read-write memory
>segments rather than readonly segments, which is probably worse on
>multicore systems?
>
>I really think that anything that we can do to make this near-c-speed
>should be done; none of the proposals are *that* complicated.
>
>Using keys, NumPy can in the C code choose to be slower but more
>readable; but using interned string forces cython to be slower, cython
>gets no way of choosing to go faster. (to the degree that it has an
>effect; none of these claims were checked)
>
>Dag
>
>
>>
>>- Robert
>>___
>>cython-devel mailing list
>>cython-devel@python.org
>>http://mail.python.org/mailman/listinfo/cython-devel
>
>-- 
>Sent from my Android phone with K-9 Mail. Please excuse my brevity.
>___
>cython-devel mailing list
>cython-devel@python.org
>http://mail.python.org/mailman/listinfo/cython-devel

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] CEP1000: Native dispatch through callables

2012-04-13 Thread Robert Bradshaw
On Fri, Apr 13, 2012 at 3:06 PM, Dag Sverre Seljebotn
 wrote:
>
>
> Robert Bradshaw  wrote:
>
>>On Fri, Apr 13, 2012 at 1:27 PM, Dag Sverre Seljebotn
>> wrote:
>>> Ah, I didn't think about 6-bit or huffman. Certainly helps.
>>
>>Yeah, we don't want to complicate the ABI too much, but I think
>>something like 8 4-bit common chars and 32 6-bit other chars (or 128
>>8-bit other chars) wouldn't be outrageous. The fact that we only have
>>to encode into a single word makes the algorithm very simple (though
>>the majority of the time we'd spit out pre-encoded literals). We have
>>a version number to play with this as well.
>>
>>> I'm almost +1 on your proposal now, but a couple of more ideas:
>>>
>>> 1) Let the key (the size_t) spill over to the next specialization
>>entry if
>>> it is too large; and prepend that key with a continuation code (two
>>size-ts
>>> could together say "iii)-d\0\0" on 32 bit systems with 8bit encoding,
>>using
>>> - as continuation). The key-based caller will expect a continuation
>>if it
>>> knows about the specialization, and the prepended char will prevent
>>spurios
>>> matches against the overspilled slot.
>>>
>>> We could even use the pointers for part of the continuation...
>>>
>>> 2) Separate the char* format strings from the keys, ie this memory
>>layout:
>>>
>>>
>>Version,nslots,nspecs,funcptr,key,funcptr,key,...,sigcharptr,sigcharptr...
>>>
>>> Where nslots is larger than nspecs if there are continuations.
>>>
>>> OK, this is getting close to my original proposal, but the difference
>>is the
>>> contiunation char, so that if you expect a short signature, you can
>>safely
>>> scan every slot and branching and no null-checking necesarry.
>>
>>I don't think we need nslots (though it might be interesting). My
>>thought is that once you start futzing with variable-length keys, you
>>might as well just compare char*s.
>
> This is where we disagree. If you are the caller you know at compile-time how 
> much you want to match; I think comparing 2 or 3 size-t with no looping is a 
> lot better (a fully-unrolled, 64-bit per instruction strcmp with one of the 
> operands known to the compiler...).

Doesn't the compiler unroll strcmp much like this for a known operand?

>>If one is concerned about memory, one could force the sigcharptr to be
>>aligned, and then the "keys" could be either sigcharptr or key
>>depending on whether the least significant bit was set. One could
>>easily scan for/switch on a key and scanning for a char* would be
>>almost as easy (just don't dereference if the lsb is set).
>>
>>I don't see us being memory constrained, so
>>
>>(version,nspecs,futureuse),(key,sigcharptr,funcptr)*,optionalsigchardata*
>>
>>seems fine to me even if only one of key/sigchrptr is ever used per
>>spec. Null-terminating the specs would work fine as well (one less
>>thing to keep track of during iteration).
>
> Well, can't one always use more L1 cache, or is that not a concern? If you 
> have 5-6 different routines calling each other using this mechanism, each 
> with multiple specializations, those unused slots translate to many cache 
> lines wasted.
>
> I don't think it is that important, I just think that how pretty the C struct 
> declaration ends up looking should not be a concern at all, when the whole 
> point of this is speed anyway. You can always just use a throwaway struct 
> declaration and a cast to get whatever layout you need. If the 'padding' 
> leads to less branching then fine, but I don't see that it helps in any way.

I was more concerned about guaranteeing each char* was aligned.

> To refine my proposal a bit, we have a list of variable size entries,
>
> (keydata, keydata, ..., funcptr)
>
> where each keydata and the ptr is 64 bits on all platforms (see below); each 
> entry must have a total length multiple of 128 bits (so that one can safely 
> scan for a signature in 128 bit increments in the data *without* parsing or 
> branching, you'll never hit a pointer), and each key but the first starts 
> with a 'dash'.

Ah, OK, similar to UTF-8. Yes, I like this idea.

> Signature strings are either kept separate, or even parsed/decoded from the 
> keys. We really only care about speed when you have compiled or JITed code 
> for the case, decoding should be fine otherwise.

True.

> BTW, won't the Cython-generated C code be a horrible mess if we use size-t 
> rather than insist on int64t? (ok, those need some ifdefs for various 
> compilers, but still seem cleaner than operating with 32bit and 64bit keys, 
> and stdint.h is winning ground).

Sure, we could require 64-bit keys (and pointer slots).



On Fri, Apr 13, 2012 at 3:22 PM, Dag Sverre Seljebotn
 wrote:
>>> I am really lost here. Why is any of this complicated encoding stuff
>>> better than interning? Interning takes one line of code, is
>>incredibly
>>> cheap (one dict lookup per call site and function definition), and it
>>> lets you check any possible signature (even complicated ones
>>involving
>>> memoryviews

Re: [Cython] CEP1000: Native dispatch through callables

2012-04-13 Thread Nathaniel Smith
On Fri, Apr 13, 2012 at 11:22 PM, Dag Sverre Seljebotn
 wrote:
>
>
> Robert Bradshaw  wrote:
>
>>On Fri, Apr 13, 2012 at 2:24 PM, Nathaniel Smith  wrote:
>>> On Fri, Apr 13, 2012 at 9:27 PM, Dag Sverre Seljebotn
>>>  wrote:
 Ah, I didn't think about 6-bit or huffman. Certainly helps.

 I'm almost +1 on your proposal now, but a couple of more ideas:

 1) Let the key (the size_t) spill over to the next specialization
>>entry if
 it is too large; and prepend that key with a continuation code (two
>>size-ts
 could together say "iii)-d\0\0" on 32 bit systems with 8bit
>>encoding, using
 - as continuation). The key-based caller will expect a continuation
>>if it
 knows about the specialization, and the prepended char will prevent
>>spurios
 matches against the overspilled slot.

 We could even use the pointers for part of the continuation...
>>>
>>> I am really lost here. Why is any of this complicated encoding stuff
>>> better than interning? Interning takes one line of code, is
>>incredibly
>>> cheap (one dict lookup per call site and function definition), and it
>>> lets you check any possible signature (even complicated ones
>>involving
>>> memoryviews) by doing a single-word comparison. And best of all, you
>>> don't have to think hard to make sure you got the encoding right. ;-)
>>>
>>> On a 32-bit system, pointers are smaller than a size_t, but more
>>> expressive! You can still do binary search if you want, etc. Is the
>>> problem just that interning requires a runtime calculation? Because I
>>> feel like C users (like numpy) will want to compute these compressed
>>> codes at module-init anyway, and those of us with a fancy compiler
>>> capable of computing them ahead of time (like Cython) can instruct
>>> that fancy compiler to compute them at module-init time just as
>>> easily?
>>
>>Good question.
>>
>>The primary disadvantage of interning that I see is memory locality. I
>>suppose if all the C-level caches of interned values were co-located,
>>this may not be as big of an issue. Not being able to compare against
>>compile-time constants may thwart some optimization opportunities, but
>>that's less clear.

I would like to see some demonstration of this. E.g., you can run this:

echo -e '#include \nint main(int argc, char ** argv) {
return strcmp(argv[0], "a"); }' | gcc -S -x c - -o - -O2 | less

Looks to me like for a short, known-at-compile-time string, with
optimization on, gcc implements it by basically sticking the string in
a global variable and then using a pointer... (If I do argv[0] ==
(char *)0x1234, then it places the constant value directly into the
instruction stream. Strangely enough, it does *not* inline the
constant value even if I do memcmp(&argv[0], "\1\2\3\4", 4), which
should be exactly equivalent...!)

I think gcc is just as likely to stick a bunch of
  static void * interned_dd_to_d;
  static void * interned_ll_to_l;
next to each other in the memory image as it is to stick a bunch of
equivalent manifest constants. If you're worried, make it static void
* interned_signatures[NUM_SIGNATURES] -- then they'll definitely be
next to each other.

>>It also requires coordination common repository, but I suppose one
>>would just stick a set in some standard module (or leverage Python's
>>interning).
>
> More problems:
>
> 1) It doesn't work well with multiple interpreter states. Ok, nothing works 
> with that at the moment, but it is on the roadmap for Python and we should 
> not make it worse.

This isn't a criticism, but I'd like to see a reference to the work in
this direction! My impression was that it's been on the roadmap for
maybe a decade, in a really desultory fashion:
  
http://docs.python.org/faq/library.html#can-t-we-get-rid-of-the-global-interpreter-lock
So if it's actually happening that's quite interesting.

> You basically *need* a thread safe store separate from any python 
> interpreter; though pythread.h does not rely on the interpreter state; which 
> helps.

Anyway, yes, if you can't rely on the interpreter than you'd need some
place to store the intern table, but I'm not sure why this would be a
problem (in Python 3.6 or whenever it becomes relevant).

> 2) you end up with the known comparison values in read-write memory segments 
> rather than readonly segments, which is probably worse on multicore systems?

Is it? Can you elaborate? Cache ping-ponging is certainly bad, but
that's when multiple cores are writing to the same cache line, I can't
see how the TLB flags would matter.

I guess the problem would be if you also have some other data in the
global variable space that you write to constantly, and then it turned
out they were placed next to these read-only comparison values in the
same cache line?

> I really think that anything that we can do to make this near-c-speed should 
> be done; none of the proposals are *that* complicated.

I agree, but I object to codifying the waving on dead chickens. :-)

> Using keys,

Re: [Cython] CEP1000: Native dispatch through callables

2012-04-13 Thread Greg Ewing

Dag Sverre Seljebotn wrote:


1) It doesn't work well with multiple interpreter states. Ok, nothing works
with that at the moment, but it is on the roadmap for Python


Is it really? I got the impression that it's not considered feasible,
since it would require massive changes to the entire implementation
and totally break the existing C API. Has someone thought of a way
around those problems?

--
Greg
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel