On Sat, Apr 14, 2012 at 2:00 PM, mark florisson <markflorisso...@gmail.com> wrote: > On 14 April 2012 20:08, Dag Sverre Seljebotn <d.s.seljeb...@astro.uio.no> > wrote: >> On 04/13/2012 12:11 AM, Dag Sverre Seljebotn wrote: >>> >>> Travis Oliphant recently raised the issue on the NumPy list of what >>> mechanisms to use to box native functions produced by his Numba so that >>> SciPy functions can call it, e.g. (I'm making the numba part up): >> >> <snip> >> >> This thread is turning into one of those big ones... >> >> But I think it is really worth it in the end; I'm getting excited about the >> possibility down the road of importing functions using normal Python >> mechanisms and still have fast calls. >> >> Anyway, to organize discussion I've tried to revamp the CEP and describe >> both the intern-way and the strcmp-way. >> >> The wiki appears to be down, so I'll post it below... >> >> Dag >> >> = CEP 1000: Convention for native dispatches through Python callables = >> >> Many callable objects are simply wrappers around native code. This >> holds for any Cython function, f2py functions, manually >> written CPython extensions, Numba, etc. >> >> Obviously, when native code calls other native code, it would be >> nice to skip the significant cost of boxing and unboxing all the arguments. >> Early binding at compile-time is only possible >> between different Cython modules, not between all the tools >> listed above. >> >> [[enhancements/nativecall|CEP 523]] deals with Cython-specific aspects >> (and is out-of-date w.r.t. this CEP); this CEP is intended to be about >> a cross-project convention only. If a success, this CEP may be >> proposesd as a PEP in a modified form. >> >> Motivating example (looking a year or two into the future): >> >> {{{ >> @numba >> def f(x): return 2 * x >> >> @cython.inline >> def g(x : cython.double): return 3 * x >> >> from fortranmod import h >> >> print f(3) >> print g(3) >> print h(3) >> print scipy.integrate.quad(f, 0.2, 3) # fast callback! >> print scipy.integrate.quad(g, 0.2, 3) # fast callback! >> print scipy.integrate.quad(h, 0.2, 3) # fast callback! >> >> }}} >> >> == The native-call slot == >> >> We need ''fast'' access to probing whether a callable object supports >> this CEP. Other mechanisms, such as an attribute in a dict, is too >> slow for many purposes (quoting robertwb: "We're trying to get a 300ns >> dispatch down to 10ns; you do not want a 50ns dict lookup"). (Obviously, >> if you call a callable in a loop you can fetch the pointer outside >> of the loop. But in particular if this becomes a language feature >> in Cython it will be used in all sorts of places.) >> >> So we hack another type slot into existing and future CPython >> implementations in the following way: This CEP provides a C header >> that for all Python versions define a macro >> {{{Py_TPFLAGS_UNOFFICIAL_EXTRAS}}} for a free bit in >> {{{tp_flags}}} in the {{{PyTypeObject}}}. >> >> If present, then we extend {{{PyTypeObject}}} >> as follows: >> {{{ >> typedef struct { >> PyTypeObject tp_main; >> size_t tp_unofficial_flags; >> size_t tp_nativecall_offset; >> } PyUnofficialTypeObject; >> }}} >> >> {{{tp_unofficial_flags}}} is unused and should be all 0 for the time >> being, but can be used later to indicate features beyond this CEP. >> >> If {{{tp_nativecall_offset != 0}}}, this CEP is supported, and >> the information for doing a native dispatch on a callable {{{obj}}} >> is located at >> {{{ >> (char*)obj + ((PyUnofficialTypeObject*)obj->ob_type)->tp_nativecall_offset; >> }}} >> >> === GIL-less accesss === >> >> It is OK to access the native-call table without holding the GIL. This >> should of course only be used to call functions that state in their >> signature that they don't need the GIL. >> >> This is important for JITted callables who would like to rewrite their >> table as more specializations gets added; if one needs to reallocate >> the table, the old table must linger along long enough that all >> threads that are currently accessing it are done with it. >> >> == Native dispatch descriptor == >> >> The final format for the descriptor is not agreed upon yet; this sums >> up the major alternatives. >> >> The descriptor should be a list of specializations/overload, each >> described by a function pointer and a signature specification >> string, such as "id)i" for {{{int f(int, double)}}}. >> >> The way it is stored must cater for two cases; first, when the caller >> expects one or more hard-coded signatures: >> {{{ >> if (obj has signature "id)i") { >> call; >> } else if (obj has signature "if)i") { >> call with promoted second argument; >> } else { >> box all arguments; >> PyObject_Call; >> } >> }}} > > There may be a lot of promotion/demotion (you likely only want the > former) combinations, especially for multiple arguments, so perhaps it > makes sense to limit ourselves a bit. For instance for numeric scalar > argument types we could limit to long (and the unsigned counterparts), > double and double complex. > > So char, short and int scalars will be > promoted to long, float to double and float complex to double complex. > Anything bigger, like long long etc will be matched specifically. > Promotions and associated demotions if necessary in the callee should > be fairly cheap compared to checking all combinations or going through > the python layer.
True, though this could be a convention rather than a requirement of the spec. Long vs. < long seems natural, but are there any systems where (scalar) float still has an advantage over double? Of course pointers like float* vs double* can't be promoted, so we would still need this kind of type declaration. >> The second is when a call stack is built dynamically while parsing the >> string. Since this has higher overhead anyway, optimizing for the first >> case makes sense. >> >> === Approach 1: Interning/run-time allocated IDs === >> >> >> 1A: Let each overload have a struct >> {{{ >> struct { >> size_t signature_id; >> char *signature; >> void *func_ptr; >> }; >> }}} >> Within each process run, there is a 1:1 between {{{signature}}} and >> {{{signature_id}}}. {{{signature_id}}} is allocated by some central >> registry. >> >> 1B: Intern the string instead: >> {{{ >> struct { >> char *signature; /* pointer must come from the central registry */ >> void *func_ptr; >> }; >> }}} >> However this is '''not'' trivial, since signature strings can >> be allocated on the heap (e.g., a JIT would do this), so interned strings >> must be memory managed and reference counted. This could be done by >> each object passing in the signature '''both''' when incref-ing and >> decref-ing the signature string in the interning machinery. >> Using Python {{{bytes}}} objects is another option. >> >> ==== Discussion ==== >> >> '''The cost of comparing a signature''': Comparing a global variable >> (needle) >> to a value that is guaranteed to already be in cache (candidate match) >> >> '''Pros:''' >> >> * Conceptually simple struct format. >> >> '''Cons:''' >> >> * Requires a registry for interning strings. This must be >> "handshaked" between the implementors of this CEP (probably by >> "first to get at {{{sys.modules["_nativecall"}}} sticks it there), >> as we can't ship a common dependency library for this CEP. >> >> === Approach 2: Efficient strcmp of verbatim signatures === >> >> The idea is to store the full signatures and the function pointers together >> in the same memory area, but still have some structure to allow for quick >> scanning through the list. >> >> Each entry has the structure {{{[signature_string, funcptr]}}} >> where: >> >> * The signature string has variable length, but the length is >> divisible by 8 bytes on all platforms. The {{{funcptr}}} is always >> 8 bytes (it is padded on 32-bit systems). >> >> * The total size of the entry should be divisible by 16 bytes (= the >> signature data should be 8 bytes, or 24 bytes, or...) >> >> * All but the first chunk of signature data should start with a >> continuation character "-", i.e. a really long signature string >> could be {{{"iiiidddd-iiidddd-iiidddd-)d"}}}. That is, a "-" is >> inserted on all positions in the string divisible by 8, except the >> first. >> >> The point is that if you know a signature, you can quickly scan >> through the binary blob for the signature in 128 bit increments, >> without worrying about the variable size nature of each entry. The >> rules above protects against spurious matches. Note that these two approaches need not be mutually exclusive; a cutoff could be established giving the best (and worst) of both. >> ==== Optional: Encoding ==== >> >> The strcmp approach can be made efficient for larger signatures by >> using a more efficient encoding than ASCII. E.g., an encoding could >> use 4 bits for the 12 most common symbols and 8 bits >> for 64 symbols (for a total of 78 symbols), of which some could be >> letter combinations ("Zd", "T{"). This should be reasonably simple >> to encode and decode. >> >> The CEP should provide C routines in a header file to work with the >> signatures. Callers that wish to parse the format string and build a >> call stack on the fly should probably work with the encoded >> representation. >> >> ==== Discussion ==== >> >> '''The cost of comparing a signature''': For the vast majority of >> functions, the cost is comparing a 64-bit number stored in the CPU >> instruction stream (needle) to a value that is guaranteed to already >> be in cache (candidate match). >> >> '''Pros:''' >> >> * Readability-wise, one can use the C switch statement to dispatch >> >> * "Stateless data", for compiled code it does not require any >> run-time initialization like interning does >> >> * One less pointer-dereference in the common case of a short >> signature >> >> '''Cons:''' >> >> * Long signatures will require more than 8 bytes to store and could >> thus be more expensive than interned strings >> >> * Format looks uglier in the form of literals in C source code >> >> >> == Signature strings == >> >> Example: The function >> {{{ >> int f(double x, float y); >> }}} >> would have the signature string {{{"df)i"}}} (or, to save space, >> {{{"idf"}}}). >> >> Fields would follow the PEP3118 extensions of the struct-module format >> string, but with some modifications: >> >> * The format should be canonical and fit for {{{strcmp}}}-like >> comparison: No whitespace, no field names (TBD: what else?) > > I think alignment is also a troublemaker. Maybe we should allow '@' > (which cannot appear in the character string but will be the default, > that is native size, alignment and byteorder) and '^', unaligned > native size and byteorder (to be used for packed structs). > >> * TBD: Information about GIL requirements (nogil, with gil?), how >> exceptions are reported > > Maybe that could be a separate list, to be consulted mostly for > explicit casts (I think PyErr_Occurred() would be the default for > non-object return types). > >> * TBD: Support for Cython-specific constructs like memoryview slices >> (so that arrays with strides and shape can be passed faster than >> passing an {{{"O"}}}). > > Definitely, maybe something simple like M{1f}, for a 1D memoryview > slice of floats. It would certainly be useful to have special syntax for memory views (after nailing down a well-defined ABI for them) and builtin types. Being able to declare something as taking a "sage.rings.integer.Integer" could also prove useful, but could result in long (and prefix-sharing) signatures, favoring the runtime-allocated ids. _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel