On 04/19/2012 12:56 PM, Dag Sverre Seljebotn wrote:
On 04/19/2012 11:07 AM, Nathaniel Smith wrote:
On Wed, Apr 18, 2012 at 10:58 PM, Dag Sverre Seljebotn
<d.s.seljeb...@astro.uio.no> wrote:
On 04/18/2012 11:35 PM, Dag Sverre Seljebotn wrote:
On 04/17/2012 02:24 PM, Dag Sverre Seljebotn wrote:
On 04/13/2012 12:11 AM, Dag Sverre Seljebotn wrote:
Travis Oliphant recently raised the issue on the NumPy list of what
mechanisms to use to box native functions produced by his Numba so
that SciPy functions can call it, e.g. (I'm making the numba part
up):
@numba # Compiles function using LLVM def f(x): return 3 * x
print scipy.integrate.quad(f, 1, 2) # do many callbacks natively!
Obviously, we want something standard, so that Cython functions can
also be called in a fast way.
OK, here's the benchmark code I've written:
https://github.com/dagss/cep1000
Assumptions etc.:
- (Very) warm cache case is tested
- I compile and link libmycallable.so, libmycaller.so and ./bench;
with
-fPIC, to emulate the Python environment
- I use mostly pure C but use PyTypeObject in order to get the offsets
to tp_flags etc right (I emulate the checking that would happen on a
PyObject* according to CEP1000).
- The test function is "double f(double x) { return x * x; }
- The benchmark is run in a loop J=1000000 times (and time divided by
J). This is repeated K=10000 times and the minimum walltime of the
K run
is used. This gave very stable readings on my system.
Fixing loop iterations:
In the initial results I just scanned the overload list until
NULL-termination. It seemed to me that the code generated for this
scanning was the most important factor.
Therefore I fixed the number of overloads as a known compile-time
macro
N *in the caller*. This is somewhat optimistic; however I didn't
want to
play with figuring out loop unrolling etc. at the same time, and
hardcoding the length of the overload list sort of took that part
out of
the equation.
Table explanation:
- N: Number of overloads in list. For N=10, there's 9 non-matching
overloads in the list before the matching 10 (but caller doesn't know
this). For N=1, the caller knows this and optimize for a hit in the
first entry.
- MISMATCHES: If set, the caller tries 4 non-matching signatures
before
hitting the final one. If not set, only the correct signature is
tried.
- LIKELY: If set, a GCC likely() macro is used to expect that the
signature matches.
RESULTS:
Direct call (and execution of!) the function in benchmark loop took
4.8 ns.
An indirect dispatch through a function pointer of known type took
5.4 ns
Notation below is (intern key), in ns
N=1:
MISMATCHES=False:
LIKELY=True: 6.44 6.44
LIKELY=False: 7.52 8.06
MISMATCHES=True: 8.59 8.59
N=10:
MISMATCHES=False: 17.19 19.20
MISMATCHES=True: 36.52 37.59
To be clear, "intern" is an interned "char*" (comparison with a 64
bits
global variable), while key is comparison of a size_t (comparison of a
64-bit immediate in the instruction stream).
First: My benchmarks today are a little inconsistent with earlier
results. I think I have converged now in terms of number of iterations
(higher than last time), but that doesn't explain why indirect dispatch
through function pointer is now *higher*:
Direct took 4.83 ns
Dispatch took 5.91 ns
Anyway, even if crude, hopefully this will tell us something. Order of
benchmark numbers are:
intern key get_func_intern get_func_key
where the get_func_XX versions retrieve a function pointer taking
either
a single interned signature or a single key as argument (just see
mycallable.c).
In the MISMATCHES case, the get_func_XX is called 4 times with a miss
and then with the match.
N=1
- MISMATCHES=False:
--- LIKELY=True: 5.91 6.44 8.59 9.13
--- LIKELY=False: 7.52 7.52 9.13 9.13
- MISMATCHES=True: 11.28 11.28 22.56 22.56
N=10
- MISMATCHES=False: 17.18 18.80 29.75 10.74(*)
- MISMATCHES=True: 36.06 38.13 105.00 36.52
Benchmark comments:
The one marked (*) is implemented as a switch statement with known keys
compile-time. I tried shifting around the case label values a bit but
the result persists; it could just be that the compiler does a very
good
job of the switch as well.
I should make this clearer: The issue is that the compiler may have
reordered the labels so that the hit came close to first; in the
intern case
the code is written so that the hit is always after 9 mismatches.
So I redid the (*) test using 10 cases with very different numeric
values,
and then tried each 10 as the matching case. Timings were stable for
each
choice of label (so this is not noise), with values:
13.4 11.8 11.8 12.3 10.7 11.2 12.3
Guess this is the binary decision tree Mark talked about...
Yes, if you look at the ASM (this is easier to keep track of if you
make the switch cases into round decimal numbers, like 1000, 2000,
3000...), then you can see that gcc is generating a fully unrolled
binary search, as basically a set of nested if/else's, like:
if (value< 5000) {
if (value == 2000) {
return&ptr_2000;
} else if (value == 4000) {
return&ptr_4000;
}
} else {
if (value == 6000) {
return&ptr_6000;
} else if (value == 8000) {
return&ptr_8000;
}
}
I suppose if we're ambitious we could do the same with the intern
table for Cython compile-time variants (we don't know the values ahead
of time, but we know how many there will be, so we'd generate the list
of intern values, sort it, and then replace the comparison values
above with table[middle], etc.).
Right. With everything being essentially equal, this isn't getting easier.
I thought of some drawbacks of getfuncptr:
- Important: Doesn't allow you to actually inspect the supported
signatures, which is needed (or at least convenient) if you want to use
an FFI library or do some JIT-ing. So an iteration mechanism is still
needed in addition, meaning the number of things for the object to
implement grows a bit large. Default implementations help -- OTOH there
really wasn't a major drawback with the table approach as long as JIT's
can just replace it?
- Minor: I've read that on Intel's Sandy Bridge, the micro-ops are
cached after instruction decoding, and that micro-ops cache is so
precious (and decoding so expensive) that the recommendation is no loop
unrolling at all! So essentially sticking the table in unrolled
instructions may not continue to be a good idea. (Of course, getfuncptr
doesn't ).
... (Of course, getfuncptr doesn't force you to do that, you could keep
traversing data, but without getfuncptr you are forced to take a
table-driven approach which may be better for the micro-op cache. Pure
speculation though.).
Dag
_______________________________________________
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel