On Mon, Sep 21, 2015 at 7:29 AM, Jaime Fernández del Río <jaime.f...@gmail.com> wrote: > We have the PyArrayObject vs PyArrayObject_fields definition in > ndarraytypes.h that is used to enforce access to the members through inline > functions rather than directly, which seems to me like the right way to go: > don't leave stones unturned, hide everything and provide PyUFunc_NIN, > PyUFunc_NOUT and friends to handle those too.
The PyArrayObject vs PyArrayObject_fields distinction is only enabled if a downstream library explicitly requests it with #define NPY_NO_DEPRECATED_API, though -- the idea is that the changes in this NEP would be enabled unconditionally, even for old code. So the reason nin/nout/nargs are left exposed in this proposal is that there's some existing code out there that would break (until updated) if we hid them, and not much benefit to breaking it. If we're fine with breaking that code then we could just hide them unconditionally too. The only code I found in the wild that would be affected is the "rational" user-defined dtype, which would be trivially fixable since the only thing it does with ufunc->nargs is a quick consistency check: https://github.com/numpy/numpy-dtypes/blob/c0175a6b1c5aa89b4520b29487f06d0e200e2a03/npytypes/rational/rational.c#L1140-L1151 Also it's not 100% clear right now whether we even want to keep supporting the old user-defined dtype API that this particular code is based around. But if this code uses ufunc->nargs then perhaps other code does too? I'm open to opinions -- I doubt it matters that much either way. I just want to make sure that we can hide the other stuff :-). When it comes to evolving these APIs in general: one unfortunate thing about the PyArrayObject changes in 1.7 is that because they were implemented using *inline* functions (/macros) they haven't affected the a*B*i exposure at all, even in code that has upgraded to the new calling conventions. While user code no longer *names* the internal fields directly, we still have to implement exactly the same fields and put them in exactly the same place in memory or else break ABI. And the other unfortunate thing is that we don't really have a mechanism for saying "okay, we're dropping support for the old way of doing things in 1.xx" -- in particular the current NPY_NO_DEPRECATED_API mechanism doesn't give us any way to detect and error out if someone tries to use an old version of the APIs, so ABI breaks still mean segfaults. I'm thinking that if/when we figure out how to implement the "sliding window" API/ABI idea that we talked about at SciPy, then that will give us a strategy for cleanly transitioning to a world with a maintainable API+ABI and it becomes worth sitting down and making up a set of setters/getters for the attributes that we want to make public in a maintainable way. But until then our only real options are either hard breaks or nothing, so unless we want to do a hard break there's not much point talking about it. -n > On Sun, Sep 20, 2015 at 9:13 PM, Nathaniel Smith <n...@pobox.com> wrote: >> >> Hi all, >> >> Here's a first draft NEP for comments. >> >> -- >> >> Synopsis >> ======== >> >> Improving numpy's dtype system requires that ufunc loops start having >> access to details of the specific dtype instance they are acting on: >> e.g. an implementation of np.equal for strings needs access to the >> dtype object in order to know what "n" to pass to strncmp. Similar >> issues arise with variable length strings, missing values, categorical >> data, unit support, datetime with timezone support, etc. -- this is a >> major blocker for improving numpy. >> >> Unfortunately, the current ufunc inner loop function signature makes >> it very difficult to provide this information. We might be able to >> wedge it in there, but it'd be ugly. >> >> The other option would be to change the signature. What would happen >> if we did this? For most common uses of the C API/ABI, we could do >> this easily while maintaining backwards compatibility. But there are >> also some rarely-used parts of the API/ABI that would be >> prohibitively difficult to preserve. >> >> In addition, there are other potential changes to ufuncs on the >> horizon (e.g. extensions of gufuncs to allow them to be used more >> generally), and the current API exposure is so massive that any such >> changes will be difficult to make in a fully compatible way. This NEP >> thus considers the possibility of closing down the ufunc API to a >> minimal, maintainable subset of the current API. >> >> To better understand the consequences of this potential change, I >> performed an exhaustive analysis of all the code on Github, Bitbucket, >> and Fedora, among others. The results make me highly confident that of >> all the publically available projects in the world, the only ones >> which touch the problematic parts of the ufunc API are: Numba, >> dynd-python, and `gulinalg <https://github.com/ContinuumIO/gulinalg>`_ >> (with the latter's exposure being trivial). >> >> Given this, I propose that for 1.11 we: >> 1) go ahead and hide/disable the problematic parts of the ABI/API, >> 2) coordinate with the known affected projects to minimize disruption >> to their users (which is made easier since they are all projects that >> are almost exclusively distributed via conda, which enforces strict >> NumPy ABI versioning), >> 3) publicize these changes widely so as to give any private code that >> might be affected a chance to speak up or adapt, and >> 4) leave the "ABI version tag" as it is, so as not to force rebuilds >> of the vast majority of projects that will be unaffected by these >> changes. >> >> This NEP defers the question of exactly what the improved API should >> be, since there's no point in trying to nail down the details until >> we've decided whether it's even possible to change. >> >> >> Details >> ======= >> >> The problem >> ----------- >> >> Currently, a ufunc inner loop implementation is called via the >> following function prototype:: >> >> typedef void (*PyUFuncGenericFunction) >> (char **args, >> npy_intp *dimensions, >> npy_intp *strides, >> void *innerloopdata); >> >> Here ``args`` is an array of pointers to 1-d buffers of input/output >> data, ``dimensions`` is a pointer to the number of entries in these >> buffers, ``strides`` is an array of integers giving the strides for >> each input/output array, and ``innerloopdata`` is an arbitrary void* >> supplied by whoever registered the ufunc loop. (For gufuncs, extra >> shape and stride information about the core dimensions also gets >> packed into the ends of these arrays in a somewhat complicated way.) >> >> There are 4 key items that define a NumPy array: data, shape, strides, >> dtype. Notice that this function only gets access to 3 of them. Our >> goal is to fix that. For example, a better signature would be:: >> >> typedef void (*PyUFuncGenericFunction_NEW) >> (char **data, >> npy_intp *shapes, >> npy_intp *strides, >> PyArray_Descr *dtypes, /* NEW */ >> void *innerloopdata); >> >> (In practice I suspect we might want to make some more changes as >> well, like upgrading gufunc core shape/strides to proper arguments >> instead of tacking it onto the existing arrays, and adding an "escape >> valve" void* reserved for future extensions. But working out such >> details is outside the scope of this NEP; the above will do for >> illustration.) >> >> The goal of this NEP is to clear the ground so that we can start >> supporting ufunc inner loops that take dtype arguments, and make other >> enhancements to ufunc functionality going forward. >> >> >> Proposal >> -------- >> >> Currently, the public API/ABI for ufuncs consists of the functions:: >> >> PyUFunc_GenericFunction >> >> PyUFunc_FromFuncAndData >> PyUFunc_FromFuncAndDataAndSignature >> PyUFunc_RegisterLoopForDescr >> PyUFunc_RegisterLoopForType >> >> PyUFunc_ReplaceLoopBySignature >> PyUFunc_SetUsesArraysAsData >> >> together with direct access to PyUFuncObject's internal fields:: >> >> typedef struct { >> PyObject_HEAD >> int nin, nout, nargs; >> int identity; >> PyUFuncGenericFunction *functions; >> void **data; >> int ntypes; >> int check_return; >> const char *name; >> char *types; >> const char *doc; >> void *ptr; >> PyObject *obj; >> PyObject *userloops; >> int core_enabled; >> int core_num_dim_ix; >> int *core_num_dims; >> int *core_dim_ixs; >> int *core_offsets; >> char *core_signature; >> PyUFunc_TypeResolutionFunc *type_resolver; >> PyUFunc_LegacyInnerLoopSelectionFunc *legacy_inner_loop_selector; >> PyUFunc_InnerLoopSelectionFunc *inner_loop_selector; >> PyUFunc_MaskedInnerLoopSelectionFunc *masked_inner_loop_selector; >> npy_uint32 *op_flags; >> npy_uint32 iter_flags; >> } PyUFuncObject; >> >> Obviously almost any future changes to how ufuncs work internally will >> involve touching some part of this public API/ABI. >> >> Concretely, the proposal here is that we avoid this by disabling the >> following functions (i.e., any attempt to call them should simply >> raise a ``NotImplementedError``):: >> >> PyUFunc_ReplaceLoopBySignature >> PyUFunc_SetUsesArraysAsData >> >> and that we reduce the publicly visible portion of PyUFuncObject down to:: >> >> typedef struct { >> PyObject_HEAD >> int nin, nout, nargs; >> } PyUFuncObject; >> >> >> Data on current API/ABI usage >> ----------------------------- >> >> In order to assess how much code would be affected by this proposal, I >> used a combination of Github search and Searchcode.com to trawl >> through the majority of all publicly available open source code. >> Neither search tool provides a fine-grained enough query language to >> directly tell us what we want to know, so I instead followed the >> strategy of first, casting a wide net: picking a set of search terms >> that are likely to catch all possibly-broken code (together with many >> false positives), and second, using automated tools to sift out the >> false positives and see what remained. Altogether, I reviewed 4464 >> search results. >> >> The tool I wrote to do this is `available on github >> <https://github.com/njsmith/codetrawl>`_, and so is `the analysis code >> itself <https://github.com/njsmith/ufunc-abi-analysis>`_. >> >> >> Uses of PyUFuncObject internals >> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> >> There are no functions in the public API which return >> ``PyUFuncObject*`` values directly, so any code that access >> PyUFuncObject fields will have to mention that token in the course of >> defining a variable, performing a cast, setting up a typedef, etc. >> >> Therefore, I searched Github for all files written in C, C++, >> Objective C, Python, or Cython, which mentioned either "PyUFuncObject >> AND types" or "PyUFuncObject AND NOT types". (This is to work around >> limitations on how many results Github search is willing to return to >> a single query.) In addition, I searched for ``PyUFuncObject`` on >> searchcode.com. >> >> The full report on these searches is available here: >> >> https://rawgit.com/njsmith/ufunc-abi-analysis/master/reports/pyufuncobject-report.html >> >> The following were screened out as non-problems: >> >> - Copies of NumPy itself (an astonishing number of people have checked >> in copies of it to their own source tree) >> - NumPy forks / precursors / etc. (e.g. Numeric also had a type called >> PyUFuncObject, the "bohrium" project has a fork of numpy 1.6, etc.) >> - Cython-generated boilerplate used to generate the "object has >> changed size" warning (which we `unconditionally filter out anyway >> <https://github.com/numpy/numpy/blob/master/numpy/__init__.py#L226>`_) >> - Lots of calls to ``PyUFunc_RegisterLoopForType`` and friends, which >> require casting the first argument to ``PyUFuncObject*`` >> - Misc. other unproblematic stuff (like Cython header declarations >> that never get used) >> >> There were also several cases that actually referenced PyUFuncObject >> internal fields: >> >> - The "rational" dtype from numpy-dtypes, which is used in a few >> projects, accesses ``ufunc->nargs`` as a safety check, but does not >> touch any other fields (`see here >> >> <https://github.com/numpy/numpy-dtypes/blob/c0175a6b1c5aa89b4520b29487f06d0e200e2a03/npytypes/rational/rational.c#L1140-L1151>`_). >> >> - Numba: does some rather elaborate things to support the definition >> of on-the-fly JITted ufuncs. These seem to be clear deficiencies in >> the ufunc API (e.g., there's no good way to control the lifespan of >> the array of function pointers passed to ``PyUFunc_FromFuncAndData``), >> so we should work with them to provide the API they need to do this in >> a maintainable way. Some of the relevant code: >> >> https://github.com/numba/numba/tree/master/numba/npyufunc >> >> https://github.com/numba/numba/blob/98752647a95ac6c9d480e81ca5c8afcfa3ddfd18/numba/npyufunc/_internal.c >> >> - dynd-python: Contains some code that attempts to extract the inner >> loops from a numpy ufunc object and wrap them into the dynd 'ufunc' >> equivalent: >> >> https://github.com/libdynd/dynd-python/blob/c06f8fc4e72257abac589faf76f10df8c045159b/dynd/src/numpy_ufunc_kernel.cpp >> >> - gulinalg: I'm not sure if anyone is still using this code since most >> of it was merged into numpy itself, but it's not a big deal in any >> case: all it contains is a `debugging function >> >> <https://github.com/ContinuumIO/gulinalg/blob/2ef365c48427c026dab4f45dc6f8b1b9af184460/gulinalg/src/gulinalg.c.src#L527-L550>`_ >> that dumps some internal fields from the PyUFuncObject. If you look, >> though, all calls to this function are already commented out :-). >> >> The full report is available here: >> >> https://rawgit.com/njsmith/ufunc-abi-analysis/master/reports/pyufuncobject-report.html >> >> In the course of this analysis, it was also noted that the standard >> Cython pxd files contain a wrapper for ufunc objects:: >> >> cdef class ufunc [object PyUFuncObject]: >> ... >> >> which means that Cython code can access internal struct fields via an >> object of type ``ufunc``, and thus escape our string-based search >> above. Therefore I also examined all Cython files on Github or >> searchcode.com that matched the query ``ufunc``, and searched for any >> lines matching any of the following regular expressions:: >> >> cdef\W+ufunc >> catches: 'cdef ufunc fn' >> cdef\W+.*\.\W*ufunc >> catches: 'cdef np.ufunc fn' >> <.*ufunc\W*> >> catches: '(<ufunc> fn).nargs', '(< np.ufunc > fn).nargs' >> cdef.*\(.*ufunc >> catches: 'cdef doit(np.ufunc fn, ...):' >> >> (I considered parsing the actual source and analysing it that way, but >> decided I was too lazy. This could be done if anyone is worried that >> the above regexes might miss things though.) >> >> There were zero files that contained matches for any of the above regexes: >> >> https://rawgit.com/njsmith/ufunc-abi-analysis/master/reports/ufunc-cython-report.html >> >> >> Uses of PyUFunc_ReplaceLoopBySignature >> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> >> Applying the same screening as above, the only code that was found >> that used this function is also in Numba: >> >> https://rawgit.com/njsmith/ufunc-abi-analysis/master/reports/PyUFunc_ReplaceLoopBySignature-report.html >> >> >> Uses of PyUFunc_SetUsesArraysAsData >> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> >> Aside from being semi-broken since 1.7 (it never got implemented for >> "masked" ufunc loops, i.e. those that use where=), there appear to be >> zero uses of this functionality either inside or outside NumPy: >> >> https://rawgit.com/njsmith/ufunc-abi-analysis/master/reports/PyUFunc_SetUsesArraysAsData-report.html >> >> >> Rationale >> --------- >> >> **Rationale for preserving the remaining API functions**:: >> >> PyUFunc_GenericFunction >> >> PyUFunc_FromFuncAndData >> PyUFunc_FromFuncAndDataAndSignature >> PyUFunc_RegisterLoopForDescr >> PyUFunc_RegisterLoopForType >> >> In addition to being widely used, these functions can easily be >> preserved even if we change how ufuncs work internally, because they >> only ingest loop function pointers, they never return them. So they >> can easily be modified to wrap whatever loop function(s) they receive >> inside an adapter function that calls them at the appropriate time, >> and then register that adapter function using whatever API we add in >> the future. >> >> **Rationale for preserving the particular fields that are preserved**: >> Preserving ``nargs`` lets us avoid a little bit of breakage with the >> random dtype, and it doesn't seem like preserving ``nin``, ``nout``, >> ``nargs`` fields will produce any undue burden on future changes to >> ufunc internals; even if we were to introduce variadic ufuncs we could >> always just stick a -1 in the appropriate fields or whatever. >> >> **Rationale for removing PyUFunc_ReplaceLoopBySignature**: this >> function *returns* the PyUFunc_GenericFunction that was replaced; if >> we stop representing all loops using the legacy >> PyUFunc_GenericFunction type, then this will not be possible to do in >> the future. >> >> **Rationale for removing PyUFunc_SetUsesArraysAsData**: If set as the >> ``innerloopdata`` on a ufunc loop, then this function acts as a >> sentinel value, and causes the ``innerloopdata`` to instead be set to >> a pointer to the passed-in PyArrayObjects. In principle we could >> preserve this function, but it has a number of deficiencies: >> - No-one appears to use it. >> - It's been buggy for several releases and no-one noticed. >> - AFAIK the only reason it was included in the first place is that it >> provides a backdoor for ufunc loops to get access to the dtypes -- but >> we are planning to fix this in a better way. >> - It can't be shimmed as easily as the loop registration functions, >> because we don't anticipate that the new-and-improved ufunc loop >> functions will *get* access to the array objects, only to the dtypes; >> so this would have to remain cluttering up the core dispatch path >> indefinitely. >> - We have good reason for *not* wanting to get ufunc loops get access >> to the actual array objects, because one of the goals on our roadmap >> is exactly to enable the use of ufuncs on non-ndarray objects. Giving >> ufuncs access to dtypes alone creates a clean boundary here: it >> guarantees that ufunc loops can work equally on all duck-array objects >> (so long as they have a dtype), and enforces the invariant that >> anything which affects the interpretation of data values should be >> attached to the dtype, not to the array object. >> >> >> Rejected alternatives >> --------------------- >> >> **Do nothing**: there's no way we'll ever be able to touch ufuncs at >> all if we don't hide those fields sooner or later. While any amount of >> breakage is regrettable, the costs of cleaning things up now are less >> than the costs of never improving numpy's APIs. >> >> **Somehow sneak the dtype information in via ``void >> *innerloopdata``**: This might let us preserve the signature of >> PyUFunc_GenericFunction, and thus preserve >> PyUFunc_ReplaceLoopBySignature. But we'd still have the problem of >> leaving way too much internal state exposed, and it's not even clear >> how this would work, given that we actually do want to preserve the >> use of ``innerloopdata`` for actual per-loop data. (This is where the >> PyUFunc_SetUsesArraysAsData hack fails.) >> >> >> -- >> Nathaniel J. Smith -- http://vorpus.org >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> https://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > -- > (\__/) > ( O.o) > ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de > dominación mundial. > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > https://mail.scipy.org/mailman/listinfo/numpy-discussion > -- Nathaniel J. Smith -- http://vorpus.org _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion