Re: [Cython] CEP1000: Native dispatch through callables

2012-04-18 Thread Dag Sverre Seljebotn

On 04/17/2012 02:24 PM, Dag Sverre Seljebotn wrote:

On 04/13/2012 12:11 AM, Dag Sverre Seljebotn wrote:

Travis Oliphant recently raised the issue on the NumPy list of what
mechanisms to use to box native functions produced by his Numba so
that SciPy functions can call it, e.g. (I'm making the numba part
up):

@numba # Compiles function using LLVM def f(x): return 3 * x

print scipy.integrate.quad(f, 1, 2) # do many callbacks natively!

Obviously, we want something standard, so that Cython functions can
also be called in a fast way.


OK, here's the benchmark code I've written:

https://github.com/dagss/cep1000

Assumptions etc.:

- (Very) warm cache case is tested

- I compile and link libmycallable.so, libmycaller.so and ./bench; with
-fPIC, to emulate the Python environment

- I use mostly pure C but use PyTypeObject in order to get the offsets
to tp_flags etc right (I emulate the checking that would happen on a
PyObject* according to CEP1000).

- The test function is "double f(double x) { return x * x; }

- The benchmark is run in a loop J=100 times (and time divided by
J). This is repeated K=1 times and the minimum walltime of the K run
is used. This gave very stable readings on my system.

Fixing loop iterations:

In the initial results I just scanned the overload list until
NULL-termination. It seemed to me that the code generated for this
scanning was the most important factor.

Therefore I fixed the number of overloads as a known compile-time macro
N *in the caller*. This is somewhat optimistic; however I didn't want to
play with figuring out loop unrolling etc. at the same time, and
hardcoding the length of the overload list sort of took that part out of
the equation.


Table explanation:

- N: Number of overloads in list. For N=10, there's 9 non-matching
overloads in the list before the matching 10 (but caller doesn't know
this). For N=1, the caller knows this and optimize for a hit in the
first entry.

- MISMATCHES: If set, the caller tries 4 non-matching signatures before
hitting the final one. If not set, only the correct signature is tried.

- LIKELY: If set, a GCC likely() macro is used to expect that the
signature matches.


RESULTS:

Direct call (and execution of!) the function in benchmark loop took 4.8 ns.

An indirect dispatch through a function pointer of known type took 5.4 ns

Notation below is (intern key), in ns

N=1:
MISMATCHES=False:
LIKELY=True: 6.44 6.44
LIKELY=False: 7.52 8.06
MISMATCHES=True: 8.59 8.59
N=10:
MISMATCHES=False: 17.19 19.20
MISMATCHES=True: 36.52 37.59

To be clear, "intern" is an interned "char*" (comparison with a 64 bits
global variable), while key is comparison of a size_t (comparison of a
64-bit immediate in the instruction stream).


First: My benchmarks today are a little inconsistent with earlier 
results. I think I have converged now in terms of number of iterations 
(higher than last time), but that doesn't explain why indirect dispatch 
through function pointer is now *higher*:


Direct took 4.83 ns
Dispatch took 5.91 ns

Anyway, even if crude, hopefully this will tell us something. Order of 
benchmark numbers are:


intern key get_func_intern get_func_key

where the get_func_XX versions retrieve a function pointer taking either 
a single interned signature or a single key as argument (just see 
mycallable.c).


In the MISMATCHES case, the get_func_XX is called 4 times with a miss 
and then with the match.


N=1
 - MISMATCHES=False:
 --- LIKELY=True:5.91   6.44   8.59   9.13
 --- LIKELY=False:   7.52   7.52   9.13   9.13
 - MISMATCHES=True: 11.28  11.28  22.56  22.56

N=10
 - MISMATCHES=False:17.18  18.80   29.75  10.74(*)
 - MISMATCHES=True: 36.06  38.13  105.00  36.52

Benchmark comments:

The one marked (*) is implemented as a switch statement with known keys 
compile-time. I tried shifting around the case label values a bit but 
the result persists; it could just be that the compiler does a very good 
job of the switch as well.


Overall comments:

Picking a winner doesn't get easier. I'll try to (make myself) get some 
perspective. Thinking of extreme cases where performance matters, here's 
one (sqrt is apparently 0.5 ns on my machine; sin would be 40 ns):


from numpy import sqrt, sin

cdef double f(double x):
return sqrt(x * x) # or sin(x * x)

Of course, here one could get the pointer in the module at import time. 
However, here:


from numpy import sqrt

cdef double f(double x):
return np.sqrt(x * x) # or np.sin(x * x)

the __getattr__ on np sure is larger than any effect we discuss.

From the numbers above, I think I'm ready to accept the "getfuncptr" 
approach penalty (1.6 ns for a direct hit, larger when the caller 
accepts more signatures) as acceptable, given the added flexibility. 
When you care about the 1.6 ns, you're always going to want to do early 
binding anyway.


However, just as I'm convinced about interning, there appears to be two 
new arguments for keys:


 - For a large number of 

Re: [Cython] CEP1000: Native dispatch through callables

2012-04-18 Thread Dag Sverre Seljebotn

On 04/18/2012 11:35 PM, Dag Sverre Seljebotn wrote:

On 04/17/2012 02:24 PM, Dag Sverre Seljebotn wrote:

On 04/13/2012 12:11 AM, Dag Sverre Seljebotn wrote:

Travis Oliphant recently raised the issue on the NumPy list of what
mechanisms to use to box native functions produced by his Numba so
that SciPy functions can call it, e.g. (I'm making the numba part
up):

@numba # Compiles function using LLVM def f(x): return 3 * x

print scipy.integrate.quad(f, 1, 2) # do many callbacks natively!

Obviously, we want something standard, so that Cython functions can
also be called in a fast way.


OK, here's the benchmark code I've written:

https://github.com/dagss/cep1000

Assumptions etc.:

- (Very) warm cache case is tested

- I compile and link libmycallable.so, libmycaller.so and ./bench; with
-fPIC, to emulate the Python environment

- I use mostly pure C but use PyTypeObject in order to get the offsets
to tp_flags etc right (I emulate the checking that would happen on a
PyObject* according to CEP1000).

- The test function is "double f(double x) { return x * x; }

- The benchmark is run in a loop J=100 times (and time divided by
J). This is repeated K=1 times and the minimum walltime of the K run
is used. This gave very stable readings on my system.

Fixing loop iterations:

In the initial results I just scanned the overload list until
NULL-termination. It seemed to me that the code generated for this
scanning was the most important factor.

Therefore I fixed the number of overloads as a known compile-time macro
N *in the caller*. This is somewhat optimistic; however I didn't want to
play with figuring out loop unrolling etc. at the same time, and
hardcoding the length of the overload list sort of took that part out of
the equation.


Table explanation:

- N: Number of overloads in list. For N=10, there's 9 non-matching
overloads in the list before the matching 10 (but caller doesn't know
this). For N=1, the caller knows this and optimize for a hit in the
first entry.

- MISMATCHES: If set, the caller tries 4 non-matching signatures before
hitting the final one. If not set, only the correct signature is tried.

- LIKELY: If set, a GCC likely() macro is used to expect that the
signature matches.


RESULTS:

Direct call (and execution of!) the function in benchmark loop took
4.8 ns.

An indirect dispatch through a function pointer of known type took 5.4 ns

Notation below is (intern key), in ns

N=1:
MISMATCHES=False:
LIKELY=True: 6.44 6.44
LIKELY=False: 7.52 8.06
MISMATCHES=True: 8.59 8.59
N=10:
MISMATCHES=False: 17.19 19.20
MISMATCHES=True: 36.52 37.59

To be clear, "intern" is an interned "char*" (comparison with a 64 bits
global variable), while key is comparison of a size_t (comparison of a
64-bit immediate in the instruction stream).


First: My benchmarks today are a little inconsistent with earlier
results. I think I have converged now in terms of number of iterations
(higher than last time), but that doesn't explain why indirect dispatch
through function pointer is now *higher*:

Direct took 4.83 ns
Dispatch took 5.91 ns

Anyway, even if crude, hopefully this will tell us something. Order of
benchmark numbers are:

intern key get_func_intern get_func_key

where the get_func_XX versions retrieve a function pointer taking either
a single interned signature or a single key as argument (just see
mycallable.c).

In the MISMATCHES case, the get_func_XX is called 4 times with a miss
and then with the match.

N=1
- MISMATCHES=False:
--- LIKELY=True: 5.91 6.44 8.59 9.13
--- LIKELY=False: 7.52 7.52 9.13 9.13
- MISMATCHES=True: 11.28 11.28 22.56 22.56

N=10
- MISMATCHES=False: 17.18 18.80 29.75 10.74(*)
- MISMATCHES=True: 36.06 38.13 105.00 36.52

Benchmark comments:

The one marked (*) is implemented as a switch statement with known keys
compile-time. I tried shifting around the case label values a bit but
the result persists; it could just be that the compiler does a very good
job of the switch as well.


I should make this clearer: The issue is that the compiler may have 
reordered the labels so that the hit came close to first; in the intern 
case the code is written so that the hit is always after 9 mismatches.


So I redid the (*) test using 10 cases with very different numeric 
values, and then tried each 10 as the matching case. Timings were stable 
for each choice of label (so this is not noise), with values:


13.4 11.8 11.8 12.3 10.7 11.2 12.3

Guess this is the binary decision tree Mark talked about...

Dag



Overall comments:

Picking a winner doesn't get easier. I'll try to (make myself) get some
perspective. Thinking of extreme cases where performance matters, here's
one (sqrt is apparently 0.5 ns on my machine; sin would be 40 ns):

from numpy import sqrt, sin

cdef double f(double x):
return sqrt(x * x) # or sin(x * x)

Of course, here one could get the pointer in the module at import time.
However, here:

from numpy import sqrt

cdef double f(double x):
return np.sqrt(x * x) # or np.s

Re: [Cython] CEP1000: Native dispatch through callables

2012-04-18 Thread Stefan Behnel
Dag Sverre Seljebotn, 18.04.2012 23:35:
> from numpy import sqrt, sin
> 
> cdef double f(double x):
> return sqrt(x * x) # or sin(x * x)
> 
> Of course, here one could get the pointer in the module at import time.

That optimisation would actually be very worthwhile all by itself. I mean,
we know what signatures we need for globally imported functions throughout
the module, so we can reduce the call to a single jump through a function
pointer (although likely with a preceding NULL check, which the branch
prediction would be happy to give us for free). At least as long as sqrt is
not being reassigned, but that should hit the 99% case.


> However, here:
> 
> from numpy import sqrt
> 
> cdef double f(double x):
> return np.sqrt(x * x) # or np.sin(x * x)
> 
> the __getattr__ on np sure is larger than any effect we discuss.

Yes, that would have to stay a .pxd case, I guess.


> From the numbers above, I think I'm ready to accept the "getfuncptr"
> approach penalty (1.6 ns for a direct hit, larger when the caller accepts
> more signatures) as acceptable, given the added flexibility. When you care
> about the 1.6 ns, you're always going to want to do early binding anyway.
> 
> However, just as I'm convinced about interning, there appears to be two new
> arguments for keys:
> 
>  - For a large number of overloads with getfuncptr, it can be much faster
> than interning. A 20ns difference starts to get interesting.

I don't think any of the numbers you presented marks any of the solutions
as "expensive" or "wrong". The advantage of a callback function for this is
that it is the most flexible solution that will most easily hit all use cases.

The only problem I see with getfuncptr() is that it shifts not only the
runtime work to the callee but also the development work, debugging,
optimisation, etc. We should provide a default implementation for non-JITs
in that case, preferably one that fits into a header file rather than
requiring a library. It could still become a set of C-API functions when
(if?) CPython starts to adopt this (and exposes it also for its builtins).


>  - PSF GSoC proposals are not public yet, but I think I can say as much as
> that there's a PEP 3121 (multiple interpreter states) proposal, and that
> Martin von Lowis is favourable about it. If that goes anywhere it doesn't
> make interning impossible but it requires a shared C component and changing
> the spec from PyBytesObject to char*. Perhaps that can be done in a
> PEP-ification revision though.

I asked him what he thinks about the status of that PEP and he seems to be
unhappy about the current massive lack of evaluation data regarding the
general applicability and completeness of the infrastructure. One of the
outcomes of the GSoC would be that we learn what problems actually exist
and what needs to be done (if anything) to make this work for more code out
there. IMHO, that would be a very valuable result, also for us.

Note that the focus for the GSoC project is on the stdlib C modules.
Without those, general support in Cython wouldn't be very helpful for any
real-world code.

We should move any PEP3121 related discussion to a separate (mammoth?)
thread and a new CEP, though (the tracker tickets are already there). This
is a large topic that is only loosely related to your CEP. Note that module
global C variables would no longer exist with PEP3121 either. They would
move into a module struct (basically a module closure). So we'd pay with an
indirection already, for everything.

Stefan
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel