Re: [Cython] CEP1000: Native dispatch through callables
On 04/13/2012 01:38 AM, Robert Bradshaw wrote: On Thu, Apr 12, 2012 at 3:34 PM, Dag Sverre Seljebotn wrote: On 04/13/2012 12:11 AM, Dag Sverre Seljebotn wrote: Travis Oliphant recently raised the issue on the NumPy list of what mechanisms to use to box native functions produced by his Numba so that SciPy functions can call it, e.g. (I'm making the numba part up): @numba # Compiles function using LLVM def f(x): return 3 * x print scipy.integrate.quad(f, 1, 2) # do many callbacks natively! Obviously, we want something standard, so that Cython functions can also be called in a fast way. This is very similar to CEP 523 (http://wiki.cython.org/enhancements/nativecall), but rather than Cython-to-Cython, we want something that both SciPy, NumPy, numba, Cython, f2py, fwrap can implement. Here's my proposal; Travis seems happy to implement something like it for numba and parts of SciPy: http://wiki.cython.org/enhancements/nativecall I'm sorry. HERE is the CEP: http://wiki.cython.org/enhancements/cep1000 Since writing that yesterday, I've moved more in the direction of wanting a zero-terminated list of overloads instead of providing a count, and have the fast protocol jump over the header (since version is available elsewhere), and just demand that the structure is sizeof(void*)-aligned in the first place rather than the complicated padding. Great idea to coordinate with the many other projects here. Eventually this could maybe even be a PEP. Somewhat related, I'd like to add support for Go-style interfaces. These would essentially be vtables of pre-fetched function pointers, and could play very nicely with this interface. Yep; but you agree that this can be done in isolation without considering vtables first? Have you given any thought as to what happens if __call__ is re-assigned for an object (or subclass of an object) supporting this interface? Or is this out of scope? Out-of-scope, I'd say. Though you can always write an object that detects if you assign to __call__... Minor nit: I don't think should_dereference is worth branching on, if one wants to save the allocation one can still use a variable-sized type and point to oneself. Yes, that's an extra dereference, but the memory is already likely close and it greatly simplifies the logic. But I could be wrong here. Those minor nits are exactly what I seek; since Travis will have the first implementation in numba<->SciPy, I just want to make sure that what he does will work efficiently work Cython. Can we perhaps just require that the information is embedded in the object? I must admit that when I wrote that I was mostly thinking of JIT-style code generation, where you only use should_dereference for code-generation. But yes, by converting the table to a C structure you can do without a JIT. Also, I'm not sure the type registration will scale, especially if every callable type wanted to get registered. (E.g. currently closures and generators are new types...) Where to draw the line? (Perhaps things could get registered lazily on the first __nativecall__ lookup, as they're likely to be looked up again?) Right... if we do some work to synchronize the types for Cython modules generated by the same version of Cython, we're left with 3-4 types for Cython, right? Then a couple for numba and one for f2py; so on the order of 10? An alternative is do something funny in the type object to get across the offset-in-object information (abusing the docstring, or introduce our own flag which means that the type object has an additional non-standard field at the end). Dag ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] CEP1000: Native dispatch through callables
On 04/13/2012 07:24 AM, Stefan Behnel wrote: Dag Sverre Seljebotn, 13.04.2012 00:34: On 04/13/2012 12:11 AM, Dag Sverre Seljebotn wrote: Travis Oliphant recently raised the issue on the NumPy list of what mechanisms to use to box native functions produced by his Numba so that SciPy functions can call it, e.g. (I'm making the numba part up): @numba # Compiles function using LLVM def f(x): return 3 * x print scipy.integrate.quad(f, 1, 2) # do many callbacks natively! Obviously, we want something standard, so that Cython functions can also be called in a fast way. This is very similar to CEP 523 (http://wiki.cython.org/enhancements/nativecall), but rather than Cython-to-Cython, we want something that both SciPy, NumPy, numba, Cython, f2py, fwrap can implement. Here's my proposal; Travis seems happy to implement something like it for numba and parts of SciPy: http://wiki.cython.org/enhancements/nativecall I'm sorry. HERE is the CEP: http://wiki.cython.org/enhancements/cep1000 Some general remarks: I'm all for doing something in this direction and have been hinting at it on the PyPy mailing list for a while, without reaction so far. I'll trigger them again, with a pointer to this discussion and the CEP. PyPy should be totally interested in a generic way to do fast calls into wrapped C code in general and Cython implemented functions specifically. Their JIT would then look at the function at runtime and unwrap it. There's PEP 362 which proposes a Signature object. It seems to have attracted some interest lately and Guido seems to like it also. I think we should come up with a way to add a C level interface to that, instead of designing something entirely separate. http://www.python.org/dev/peps/pep-0362/ Well, provided that you still want an efficient representation that can be strcmp-ed in dispatch codes, this seems to boil down to using a Signature object rather than a capsule (with a C interface), and store it in __signature__ rather than __fastcall__, and perhaps provide a slot in the type object for a function returning it. I really think the right approach is to prove the concept outside of the standardization process first; a) by the time a PEP would be accepted it will have been years since Travis had time to work on this, b) as far as the slot in the type object goes, we're left with users on Python 2.4 today; a Python 3.4+ solution is not really a solution. Dag ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
[Cython] PyPy sprint in Leipzig, June 22-27 (was: Re: CEP1000: Native dispatch through callables)
Stefan Behnel, 13.04.2012 07:24: > Dag Sverre Seljebotn, 13.04.2012 00:34: >> http://wiki.cython.org/enhancements/cep1000 > > I'm all for doing something in this direction and have been hinting at it > on the PyPy mailing list for a while, without reaction so far. I'll trigger > them again, with a pointer to this discussion and the CEP. PyPy should be > totally interested in a generic way to do fast calls into wrapped C code in > general and Cython implemented functions specifically. Their JIT would then > look at the function at runtime and unwrap it. BTW, there will be a PyPy sprint in Leipzig from June 22-27. If anyone's interested in coordinating with PyPy on this and other topics, that might be a good place to go for a day or two. http://permalink.gmane.org/gmane.comp.python.pypy/9896 Stefan ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] CEP1000: Native dispatch through callables
On Fri, Apr 13, 2012 at 1:52 AM, Dag Sverre Seljebotn wrote: > On 04/13/2012 01:38 AM, Robert Bradshaw wrote: >> >> On Thu, Apr 12, 2012 at 3:34 PM, Dag Sverre Seljebotn >> wrote: >>> >>> On 04/13/2012 12:11 AM, Dag Sverre Seljebotn wrote: Travis Oliphant recently raised the issue on the NumPy list of what mechanisms to use to box native functions produced by his Numba so that SciPy functions can call it, e.g. (I'm making the numba part up): @numba # Compiles function using LLVM def f(x): return 3 * x print scipy.integrate.quad(f, 1, 2) # do many callbacks natively! Obviously, we want something standard, so that Cython functions can also be called in a fast way. This is very similar to CEP 523 (http://wiki.cython.org/enhancements/nativecall), but rather than Cython-to-Cython, we want something that both SciPy, NumPy, numba, Cython, f2py, fwrap can implement. Here's my proposal; Travis seems happy to implement something like it for numba and parts of SciPy: http://wiki.cython.org/enhancements/nativecall >>> >>> >>> >>> I'm sorry. HERE is the CEP: >>> >>> http://wiki.cython.org/enhancements/cep1000 >>> >>> Since writing that yesterday, I've moved more in the direction of wanting >>> a >>> zero-terminated list of overloads instead of providing a count, and have >>> the >>> fast protocol jump over the header (since version is available >>> elsewhere), >>> and just demand that the structure is sizeof(void*)-aligned in the first >>> place rather than the complicated padding. >> >> >> Great idea to coordinate with the many other projects here. Eventually >> this could maybe even be a PEP. >> >> Somewhat related, I'd like to add support for Go-style interfaces. >> These would essentially be vtables of pre-fetched function pointers, >> and could play very nicely with this interface. > > > Yep; but you agree that this can be done in isolation without considering > vtables first? Yes, for sure. >> Have you given any thought as to what happens if __call__ is >> re-assigned for an object (or subclass of an object) supporting this >> interface? Or is this out of scope? > > > Out-of-scope, I'd say. Though you can always write an object that detects if > you assign to __call__... > > >> Minor nit: I don't think should_dereference is worth branching on, if >> one wants to save the allocation one can still use a variable-sized >> type and point to oneself. Yes, that's an extra dereference, but the >> memory is already likely close and it greatly simplifies the logic. >> But I could be wrong here. > > > Those minor nits are exactly what I seek; since Travis will have the first > implementation in numba<->SciPy, I just want to make sure that what he does > will work efficiently work Cython. +1 I have to admit building/invoking these var-arg-sized __nativecall__ records seems painful. Here's another suggestion: struct { void* pointer; size_t signature; // compressed binary representation, 95% coverage char* long_signature; // used if signature is not representable in a size_t, as indicated by signature = 0 } record; These char* could optionally be allocated at the end of the record* for optimal locality. We could even dispense with the binary signature, but having that option allows us to avoid strcmp for stuff like d)d and ffi)f. > Can we perhaps just require that the information is embedded in the object? I think not, this would require variably-sized objects (and also use up the variable sized nature). Given that this is in a portion of the program that is iterating over a Python tuple, I think the extra deference here is non-consequential. > I must admit that when I wrote that I was mostly thinking of JIT-style code > generation, where you only use should_dereference for code-generation. But > yes, by converting the table to a C structure you can do without a JIT. > > >> >> Also, I'm not sure the type registration will scale, especially if >> every callable type wanted to get registered. (E.g. currently closures >> and generators are new types...) Where to draw the line? (Perhaps >> things could get registered lazily on the first __nativecall__ lookup, >> as they're likely to be looked up again?) > > > Right... if we do some work to synchronize the types for Cython modules > generated by the same version of Cython, we're left with 3-4 types for > Cython, right? Then a couple for numba and one for f2py; so on the order of > 10? No, I think each closure is its own type. > An alternative is do something funny in the type object to get across the > offset-in-object information (abusing the docstring, or introduce our own > flag which means that the type object has an additional non-standard field > at the end). It's a hack, but the flag + non-standard field idea might just work... Ah, don't you just love C :) - Robert ___ cython-devel mai
Re: [Cython] CEP1000: Native dispatch through callables
Dag Sverre Seljebotn, 13.04.2012 11:13: > On 04/13/2012 07:24 AM, Stefan Behnel wrote: >> Dag Sverre Seljebotn, 13.04.2012 00:34: >>> http://wiki.cython.org/enhancements/cep1000 >> >> There's PEP 362 which proposes a Signature object. It seems to have >> attracted some interest lately and Guido seems to like it also. I think we >> should come up with a way to add a C level interface to that, instead of >> designing something entirely separate. >> >> http://www.python.org/dev/peps/pep-0362/ > > Well, provided that you still want an efficient representation that can be > strcmp-ed in dispatch codes, this seems to boil down to using a Signature > object rather than a capsule (with a C interface), and store it in > __signature__ rather than __fastcall__, and perhaps provide a slot in the > type object for a function returning it. Basically, yes. I was just bringing it up because we should keep it in mind when designing a solution. Moving it into the Signature object would also allow C signature introspection from Python code, for example. It would obviously need a straight C level way to access it. I'm not sure it has to be a function, though. I would prefer a simple array of structs that map signature strings to function pointers. Like the PyMethodDef struct. > I really think the right approach is to prove the concept outside of the > standardization process first; a) by the time a PEP would be accepted it > will have been years since Travis had time to work on this, b) as far as > the slot in the type object goes, we're left with users on Python 2.4 > today; a Python 3.4+ solution is not really a solution. Sure. But nothing keeps us from backporting at least parts of it to older Pythons, like we did for so many other things. Stefan ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] CEP1000: Native dispatch through callables
Robert Bradshaw, 13.04.2012 12:17: > On Fri, Apr 13, 2012 at 1:52 AM, Dag Sverre Seljebotn wrote: >> On 04/13/2012 01:38 AM, Robert Bradshaw wrote: >>> Have you given any thought as to what happens if __call__ is >>> re-assigned for an object (or subclass of an object) supporting this >>> interface? Or is this out of scope? >> >> Out-of-scope, I'd say. Though you can always write an object that detects if >> you assign to __call__... +1 for out of scope. This is a pure C level feature. >>> Minor nit: I don't think should_dereference is worth branching on, if >>> one wants to save the allocation one can still use a variable-sized >>> type and point to oneself. Yes, that's an extra dereference, but the >>> memory is already likely close and it greatly simplifies the logic. >>> But I could be wrong here. >> >> >> Those minor nits are exactly what I seek; since Travis will have the first >> implementation in numba<->SciPy, I just want to make sure that what he does >> will work efficiently work Cython. > > +1 > > I have to admit building/invoking these var-arg-sized __nativecall__ > records seems painful. Here's another suggestion: > > struct { > void* pointer; > size_t signature; // compressed binary representation, 95% coverage > char* long_signature; // used if signature is not representable in > a size_t, as indicated by signature = 0 > } record; > > These char* could optionally be allocated at the end of the record* > for optimal locality. We could even dispense with the binary > signature, but having that option allows us to avoid strcmp for stuff > like d)d and ffi)f. Assuming we use literals and a const char* for the signature, the C compiler would cut down the number of signature strings automatically for us. And a pointer comparison is the same as a size_t comparison. That would only apply at a per-module level, though, so it would require an indirection for the signature IDs. But it would avoid a global registry. Another idea would be to set the signature ID field to 0 at the beginning and call a C-API function to let the current runtime assign an ID > 0, unique for the currently running application. Then every user would only have to parse the signature once to adapt to the respective ID and could otherwise branch based on it directly. For Cython, we could generate a static ID variable for each typed call that we found in the sources. When encountering a C signature on a callable, either a) the ID variable is still empty (initial case), then we parse the signature to see if it matches the expected signature. If it does, we assign the corresponding ID to the static ID variable and issue a direct call. If b) the ID field is already set (normal case), we compare the signature IDs directly and issue a C call it they match. If the IDs do not match, we issue a normal Python call. >> Right... if we do some work to synchronize the types for Cython modules >> generated by the same version of Cython, we're left with 3-4 types for >> Cython, right? Then a couple for numba and one for f2py; so on the order of >> 10? > > No, I think each closure is its own type. And that even applies to fused functions, right? They'd have one closure for each type combination. >> An alternative is do something funny in the type object to get across the >> offset-in-object information (abusing the docstring, or introduce our own >> flag which means that the type object has an additional non-standard field >> at the end). > > It's a hack, but the flag + non-standard field idea might just work... Plus, it wouldn't have to stay a non-standard field. If it's accepted into CPython 3.4, we could safely use it in all existing versions of CPython. Stefan ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] CEP1000: Native dispatch through callables
On 04/13/2012 01:38 PM, Stefan Behnel wrote: Robert Bradshaw, 13.04.2012 12:17: On Fri, Apr 13, 2012 at 1:52 AM, Dag Sverre Seljebotn wrote: On 04/13/2012 01:38 AM, Robert Bradshaw wrote: Have you given any thought as to what happens if __call__ is re-assigned for an object (or subclass of an object) supporting this interface? Or is this out of scope? Out-of-scope, I'd say. Though you can always write an object that detects if you assign to __call__... +1 for out of scope. This is a pure C level feature. Minor nit: I don't think should_dereference is worth branching on, if one wants to save the allocation one can still use a variable-sized type and point to oneself. Yes, that's an extra dereference, but the memory is already likely close and it greatly simplifies the logic. But I could be wrong here. Those minor nits are exactly what I seek; since Travis will have the first implementation in numba<->SciPy, I just want to make sure that what he does will work efficiently work Cython. +1 I have to admit building/invoking these var-arg-sized __nativecall__ records seems painful. Here's another suggestion: struct { void* pointer; size_t signature; // compressed binary representation, 95% coverage Once you start passing around functions that take memory view slices as arguments, that 95% estimate will be off I think. char* long_signature; // used if signature is not representable in a size_t, as indicated by signature = 0 } record; These char* could optionally be allocated at the end of the record* for optimal locality. We could even dispense with the binary signature, but having that option allows us to avoid strcmp for stuff like d)d and ffi)f. Assuming we use literals and a const char* for the signature, the C compiler would cut down the number of signature strings automatically for us. And a pointer comparison is the same as a size_t comparison. I'll go one further: Intern Python bytes objects. It's just a PyObject*, but it's *required* (or just strongly encouraged) to have gone through sig = sys.modules['_nativecall']['interned_db'].setdefault(sig, sig) Obviously in a PEP you'd have a C-API function for such interning (completely standalone utility). Performance of interning operation itself doesn't matter... Unless CPython has interning features itself, like in Java? Was that present back in the day and then ripped out? Requiring interning is somewhat less elegant in one way, but it makes a lot of other stuff much simpler. That gives us struct { void *pointer; PyBytesObject *signature; } record; and then you allocate a NULL-terminated arrays of these for all the overloads. That would only apply at a per-module level, though, so it would require an indirection for the signature IDs. But it would avoid a global registry. Another idea would be to set the signature ID field to 0 at the beginning and call a C-API function to let the current runtime assign an ID> 0, unique for the currently running application. Then every user would only have to parse the signature once to adapt to the respective ID and could otherwise branch based on it directly. For Cython, we could generate a static ID variable for each typed call that we found in the sources. When encountering a C signature on a callable, either a) the ID variable is still empty (initial case), then we parse the signature to see if it matches the expected signature. If it does, we assign the corresponding ID to the static ID variable and issue a direct call. If b) the ID field is already set (normal case), we compare the signature IDs directly and issue a C call it they match. If the IDs do not match, we issue a normal Python call. Right... if we do some work to synchronize the types for Cython modules generated by the same version of Cython, we're left with 3-4 types for Cython, right? Then a couple for numba and one for f2py; so on the order of 10? No, I think each closure is its own type. And that even applies to fused functions, right? They'd have one closure for each type combination. An alternative is do something funny in the type object to get across the offset-in-object information (abusing the docstring, or introduce our own flag which means that the type object has an additional non-standard field at the end). It's a hack, but the flag + non-standard field idea might just work... Plus, it wouldn't have to stay a non-standard field. If it's accepted into CPython 3.4, we could safely use it in all existing versions of CPython. Sounds good. Perhaps just find a single "extended", then add a new flag field in our payload, in case we need to extend the types object yet again later and run out of unused flag bits (TBD: figure out how many unused flag bits there are). Dag ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] CEP1000: Native dispatch through callables
On Fri, Apr 13, 2012 at 9:52 AM, Dag Sverre Seljebotn wrote: > On 04/13/2012 01:38 AM, Robert Bradshaw wrote: >> Also, I'm not sure the type registration will scale, especially if >> every callable type wanted to get registered. (E.g. currently closures >> and generators are new types...) Where to draw the line? (Perhaps >> things could get registered lazily on the first __nativecall__ lookup, >> as they're likely to be looked up again?) > > > Right... if we do some work to synchronize the types for Cython modules > generated by the same version of Cython, we're left with 3-4 types for > Cython, right? Then a couple for numba and one for f2py; so on the order of > 10? > > An alternative is do something funny in the type object to get across the > offset-in-object information (abusing the docstring, or introduce our own > flag which means that the type object has an additional non-standard field > at the end). In Python 2.7, it looks like there may be a few TP_FLAG bits free -- 15 and 16 are labeled "reserved for stackless python", and 2, 11, 22 don't have anything defined. There may also be an unused ssize_t field ob_size at the beginning of the type object -- for some reason PyTypeObject is declared as variable size (using PyObject_VAR_HEAD), but I don't see any variable-size fields in it, the docs claim that the ob_size field is a "historical artifact that is maintained for binary compatibility...Always set this field to zero", and Include/object.h has a definition for a PyHeapTypeObject which has a PyTypeObject as its first member, which would not work if PyTypeObject had variable size. Grep says that the only place where ob_type->ob_size is accessed is in Objects/typeobject.c:object_sizeof(), which at first glance appears to be a bug, and anyway I don't think anyone cares whether __sizeof__ on C-callable objects is exactly correct. One could use this for an offset, or even a pointer. One could also add a field easily by just subclassing PyTypeObject. The Signature thing seems like a distraction to me. Signature is intended as just a nice convenient format for looking up stuff that's otherwise stored in more obscure ways -- the API equivalent of pretty-printing. The important thing here is getting the C-level dispatch right. -- Nathaniel ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] CEP1000: Native dispatch through callables
On Fri, Apr 13, 2012 at 12:59 PM, Dag Sverre Seljebotn wrote: > I'll go one further: Intern Python bytes objects. It's just a PyObject*, but > it's *required* (or just strongly encouraged) to have gone through > > sig = sys.modules['_nativecall']['interned_db'].setdefault(sig, sig) > > Obviously in a PEP you'd have a C-API function for such interning > (completely standalone utility). Performance of interning operation itself > doesn't matter... > > Unless CPython has interning features itself, like in Java? Was that present > back in the day and then ripped out? http://docs.python.org/library/functions.html#intern ? (C API: PyString_InternInPlace, moved from __builtin__.intern to sys.intern in Py3.) - N ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] CEP1000: Native dispatch through callables
Dag Sverre Seljebotn, 13.04.2012 13:59: > On 04/13/2012 01:38 PM, Stefan Behnel wrote: >> Robert Bradshaw, 13.04.2012 12:17: >>> On Fri, Apr 13, 2012 at 1:52 AM, Dag Sverre Seljebotn wrote: On 04/13/2012 01:38 AM, Robert Bradshaw wrote: > Minor nit: I don't think should_dereference is worth branching on, if > one wants to save the allocation one can still use a variable-sized > type and point to oneself. Yes, that's an extra dereference, but the > memory is already likely close and it greatly simplifies the logic. > But I could be wrong here. Those minor nits are exactly what I seek; since Travis will have the first implementation in numba<->SciPy, I just want to make sure that what he does will work efficiently work Cython. >>> >>> I have to admit building/invoking these var-arg-sized __nativecall__ >>> records seems painful. Here's another suggestion: >>> >>> struct { >>> void* pointer; >>> size_t signature; // compressed binary representation, 95% coverage > > Once you start passing around functions that take memory view slices as > arguments, that 95% estimate will be off I think. Yes, I really think it makes sense to keeps IDs unique only over the runtime of the application. (Note that using ssize_t instead of size_t would allow setting the ID to -1 to disable signature matching, in case that's ever needed.) >>> char* long_signature; // used if signature is not representable in >>> a size_t, as indicated by signature = 0 >>> } record; >>> >>> These char* could optionally be allocated at the end of the record* >>> for optimal locality. We could even dispense with the binary >>> signature, but having that option allows us to avoid strcmp for stuff >>> like d)d and ffi)f. >> >> Assuming we use literals and a const char* for the signature, the C >> compiler would cut down the number of signature strings automatically for >> us. And a pointer comparison is the same as a size_t comparison. > > I'll go one further: Intern Python bytes objects. It's just a PyObject*, > but it's *required* (or just strongly encouraged) to have gone through > > sig = sys.modules['_nativecall']['interned_db'].setdefault(sig, sig) > > Obviously in a PEP you'd have a C-API function for such interning > (completely standalone utility). Performance of interning operation itself > doesn't matter... > > Unless CPython has interning features itself, like in Java? Was that > present back in the day and then ripped out? AFAIR, it always had to be done explicitly and is only available for unicode objects in Py3 (and only for bytes objects in Py2). The CPython parser also does it for identifiers, but it's not done automatically for anything else. It's also not cheap to do - it would require a weakref dict to accommodate for the temporary allocation of large strings, and weak references have a certain overhead. In any case, this is an entirely different use case that should be handled differently from normal string interning. > Requiring interning is somewhat less elegant in one way, but it makes a lot > of other stuff much simpler. > > That gives us > > struct { > void *pointer; > PyBytesObject *signature; > } record; > > and then you allocate a NULL-terminated arrays of these for all the overloads. However, the problem is the setup. These references will have to be created at init time and discarded during runtime termination. Not a problem for Cython generated code, but some overhead for hand written code. Since the size of these structs is not a problem, I'd prefer keeping Python objects out of the game and using an ssize_t ID instead, inferred from a char* signature at module init time by calling a C-API function. That avoids the need for any cleanup. Stefan ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] CEP1000: Native dispatch through callables
On 13 April 2012 12:38, Stefan Behnel wrote: > Robert Bradshaw, 13.04.2012 12:17: >> On Fri, Apr 13, 2012 at 1:52 AM, Dag Sverre Seljebotn wrote: >>> On 04/13/2012 01:38 AM, Robert Bradshaw wrote: Have you given any thought as to what happens if __call__ is re-assigned for an object (or subclass of an object) supporting this interface? Or is this out of scope? >>> >>> Out-of-scope, I'd say. Though you can always write an object that detects if >>> you assign to __call__... > > +1 for out of scope. This is a pure C level feature. > > Minor nit: I don't think should_dereference is worth branching on, if one wants to save the allocation one can still use a variable-sized type and point to oneself. Yes, that's an extra dereference, but the memory is already likely close and it greatly simplifies the logic. But I could be wrong here. >>> >>> >>> Those minor nits are exactly what I seek; since Travis will have the first >>> implementation in numba<->SciPy, I just want to make sure that what he does >>> will work efficiently work Cython. >> >> +1 >> >> I have to admit building/invoking these var-arg-sized __nativecall__ >> records seems painful. Here's another suggestion: >> >> struct { >> void* pointer; >> size_t signature; // compressed binary representation, 95% coverage >> char* long_signature; // used if signature is not representable in >> a size_t, as indicated by signature = 0 >> } record; >> >> These char* could optionally be allocated at the end of the record* >> for optimal locality. We could even dispense with the binary >> signature, but having that option allows us to avoid strcmp for stuff >> like d)d and ffi)f. > > Assuming we use literals and a const char* for the signature, the C > compiler would cut down the number of signature strings automatically for > us. And a pointer comparison is the same as a size_t comparison. > > That would only apply at a per-module level, though, so it would require an > indirection for the signature IDs. But it would avoid a global registry. > > Another idea would be to set the signature ID field to 0 at the beginning > and call a C-API function to let the current runtime assign an ID > 0, > unique for the currently running application. Then every user would only > have to parse the signature once to adapt to the respective ID and could > otherwise branch based on it directly. > > For Cython, we could generate a static ID variable for each typed call that > we found in the sources. When encountering a C signature on a callable, > either a) the ID variable is still empty (initial case), then we parse the > signature to see if it matches the expected signature. If it does, we > assign the corresponding ID to the static ID variable and issue a direct > call. If b) the ID field is already set (normal case), we compare the > signature IDs directly and issue a C call it they match. If the IDs do not > match, we issue a normal Python call. > > >>> Right... if we do some work to synchronize the types for Cython modules >>> generated by the same version of Cython, we're left with 3-4 types for >>> Cython, right? Then a couple for numba and one for f2py; so on the order of >>> 10? >> >> No, I think each closure is its own type. > > And that even applies to fused functions, right? They'd have one closure > for each type combination. > Hm, there is only one type for the function (CyFunction), but there is a different type for the closure scope for each closure. The same goes for FusedFunction, there is only one type, and each instance contains a dict of specializations (mapping signatures to PyCFunctions). (But each module still has different function types of course). >>> An alternative is do something funny in the type object to get across the >>> offset-in-object information (abusing the docstring, or introduce our own >>> flag which means that the type object has an additional non-standard field >>> at the end). >> >> It's a hack, but the flag + non-standard field idea might just work... > > Plus, it wouldn't have to stay a non-standard field. If it's accepted into > CPython 3.4, we could safely use it in all existing versions of CPython. > > Stefan > ___ > cython-devel mailing list > cython-devel@python.org > http://mail.python.org/mailman/listinfo/cython-devel ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] CEP1000: Native dispatch through callables
On 13 April 2012 12:59, Dag Sverre Seljebotn wrote: > On 04/13/2012 01:38 PM, Stefan Behnel wrote: >> >> Robert Bradshaw, 13.04.2012 12:17: >>> >>> On Fri, Apr 13, 2012 at 1:52 AM, Dag Sverre Seljebotn wrote: On 04/13/2012 01:38 AM, Robert Bradshaw wrote: > > Have you given any thought as to what happens if __call__ is > re-assigned for an object (or subclass of an object) supporting this > interface? Or is this out of scope? Out-of-scope, I'd say. Though you can always write an object that detects if you assign to __call__... >> >> >> +1 for out of scope. This is a pure C level feature. >> >> > Minor nit: I don't think should_dereference is worth branching on, if > one wants to save the allocation one can still use a variable-sized > type and point to oneself. Yes, that's an extra dereference, but the > memory is already likely close and it greatly simplifies the logic. > But I could be wrong here. Those minor nits are exactly what I seek; since Travis will have the first implementation in numba<->SciPy, I just want to make sure that what he does will work efficiently work Cython. >>> >>> >>> +1 >>> >>> I have to admit building/invoking these var-arg-sized __nativecall__ >>> records seems painful. Here's another suggestion: >>> >>> struct { >>> void* pointer; >>> size_t signature; // compressed binary representation, 95% coverage > > > Once you start passing around functions that take memory view slices as > arguments, that 95% estimate will be off I think. > It kind of depends on which arguments types and how many arguments you will allow, and whether or not collisions would be fine (which would imply ID comparison + strcmp()). >>> char* long_signature; // used if signature is not representable in >>> a size_t, as indicated by signature = 0 >>> } record; >>> >>> These char* could optionally be allocated at the end of the record* >>> for optimal locality. We could even dispense with the binary >>> signature, but having that option allows us to avoid strcmp for stuff >>> like d)d and ffi)f. >> >> >> Assuming we use literals and a const char* for the signature, the C >> compiler would cut down the number of signature strings automatically for >> us. And a pointer comparison is the same as a size_t comparison. > > > I'll go one further: Intern Python bytes objects. It's just a PyObject*, but > it's *required* (or just strongly encouraged) to have gone through > > sig = sys.modules['_nativecall']['interned_db'].setdefault(sig, sig) > > Obviously in a PEP you'd have a C-API function for such interning > (completely standalone utility). Performance of interning operation itself > doesn't matter... > > Unless CPython has interning features itself, like in Java? Was that present > back in the day and then ripped out? > > Requiring interning is somewhat less elegant in one way, but it makes a lot > of other stuff much simpler. > > That gives us > > struct { > void *pointer; > PyBytesObject *signature; > } record; > > and then you allocate a NULL-terminated arrays of these for all the > overloads. > Interesting. What I like about size_t it that it could define a deterministic ordering, which means specializations could be stored in a binary search tree in array form. Cython would precompute the size_t for the specialization it needs (and maybe account for promotions as well). >> >> That would only apply at a per-module level, though, so it would require >> an >> indirection for the signature IDs. But it would avoid a global registry. >> >> Another idea would be to set the signature ID field to 0 at the beginning >> and call a C-API function to let the current runtime assign an ID> 0, >> unique for the currently running application. Then every user would only >> have to parse the signature once to adapt to the respective ID and could >> otherwise branch based on it directly. >> >> For Cython, we could generate a static ID variable for each typed call >> that >> we found in the sources. When encountering a C signature on a callable, >> either a) the ID variable is still empty (initial case), then we parse the >> signature to see if it matches the expected signature. If it does, we >> assign the corresponding ID to the static ID variable and issue a direct >> call. If b) the ID field is already set (normal case), we compare the >> signature IDs directly and issue a C call it they match. If the IDs do not >> match, we issue a normal Python call. >> >> Right... if we do some work to synchronize the types for Cython modules generated by the same version of Cython, we're left with 3-4 types for Cython, right? Then a couple for numba and one for f2py; so on the order of 10? >>> >>> >>> No, I think each closure is its own type. >> >> >> And that even applies to fused functions, right? They'd have one closure >> for each type combination. >> >> An alternative is
Re: [Cython] CEP1000: Native dispatch through callables
Stefan Behnel, 13.04.2012 14:27: > Dag Sverre Seljebotn, 13.04.2012 13:59: >> Requiring interning is somewhat less elegant in one way, but it makes a lot >> of other stuff much simpler. >> >> That gives us >> >> struct { >> void *pointer; >> PyBytesObject *signature; >> } record; >> >> and then you allocate a NULL-terminated arrays of these for all the >> overloads. > > However, the problem is the setup. These references will have to be created > at init time and discarded during runtime termination. Not a problem for > Cython generated code, but some overhead for hand written code. > > Since the size of these structs is not a problem, I'd prefer keeping Python > objects out of the game and using an ssize_t ID instead, inferred from a > char* signature at module init time by calling a C-API function. That > avoids the need for any cleanup. Actually, we could even use interned char* values. Nothing keeps that C-API setup function from reassigning the "char* signature" field to the char* buffer of an internally allocated byte string. Except that we'd have to *require* users to use literals or otherwise statically allocated C strings in that field. Hmm, maybe not the best idea ever... Stefan ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] CEP1000: Native dispatch through callables
On 13 April 2012 13:48, Stefan Behnel wrote: > Stefan Behnel, 13.04.2012 14:27: >> Dag Sverre Seljebotn, 13.04.2012 13:59: >>> Requiring interning is somewhat less elegant in one way, but it makes a lot >>> of other stuff much simpler. >>> >>> That gives us >>> >>> struct { >>> void *pointer; >>> PyBytesObject *signature; >>> } record; >>> >>> and then you allocate a NULL-terminated arrays of these for all the >>> overloads. >> >> However, the problem is the setup. These references will have to be created >> at init time and discarded during runtime termination. Not a problem for >> Cython generated code, but some overhead for hand written code. >> >> Since the size of these structs is not a problem, I'd prefer keeping Python >> objects out of the game and using an ssize_t ID instead, inferred from a >> char* signature at module init time by calling a C-API function. That >> avoids the need for any cleanup. > > Actually, we could even use interned char* values. Nothing keeps that C-API > setup function from reassigning the "char* signature" field to the char* > buffer of an internally allocated byte string. Except that we'd have to > *require* users to use literals or otherwise statically allocated C strings > in that field. Hmm, maybe not the best idea ever... > > Stefan > ___ > cython-devel mailing list > cython-devel@python.org > http://mail.python.org/mailman/listinfo/cython-devel You could create a module shared by all versions and projects, which exposes a function 'get_signature', which given a char *signature returns the pointer that should be used in the ABI signature type information. You can then always compare by identity. ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] CEP1000: Native dispatch through callables
On 04/13/2012 03:01 PM, mark florisson wrote: On 13 April 2012 13:48, Stefan Behnel wrote: Stefan Behnel, 13.04.2012 14:27: Dag Sverre Seljebotn, 13.04.2012 13:59: Requiring interning is somewhat less elegant in one way, but it makes a lot of other stuff much simpler. That gives us struct { void *pointer; PyBytesObject *signature; } record; and then you allocate a NULL-terminated arrays of these for all the overloads. However, the problem is the setup. These references will have to be created at init time and discarded during runtime termination. Not a problem for Cython generated code, but some overhead for hand written code. Since the size of these structs is not a problem, I'd prefer keeping Python objects out of the game and using an ssize_t ID instead, inferred from a char* signature at module init time by calling a C-API function. That avoids the need for any cleanup. Actually, we could even use interned char* values. Nothing keeps that C-API setup function from reassigning the "char* signature" field to the char* buffer of an internally allocated byte string. Except that we'd have to *require* users to use literals or otherwise statically allocated C strings in that field. Hmm, maybe not the best idea ever... Stefan ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel You could create a module shared by all versions and projects, which exposes a function 'get_signature', which given a char *signature returns the pointer that should be used in the ABI signature type information. You can then always compare by identity. I fail to see how this is different from what I proposed, with interning bytes objects (which I still prefer; although the binary-search features of direct comparison makes that attractive too). BTW, any proposal that requires an actual project/library that both Cython and NumPy depends on will fail in the real world. Dag ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] pyregr test suite
Stefan Behnel, 13.04.2012 07:11: > Robert Bradshaw, 12.04.2012 22:21: >> On Thu, Apr 12, 2012 at 11:21 AM, mark florisson wrote: >>> Could we run the pyregr test suite manually instead of automatically? >>> It takes a lot of resources to build, and a single simple push to the >>> cython-devel branch results in the build slots being hogged for hours, >>> making the continuous development a lot less 'continuous'. We could >>> just decide to run the pyregr suite every so often, or whenever we >>> make an addition or change that could actually affect Python code (if >>> one updates a test then there is no use in running pyregr for >>> instance). >> >> +1 to manual + periodic for these tests. Alternatively we could make >> them depend on each other, so at most one core is consumed. > > Ok, I'll set it up. They are now triggered by the (nightly) CPython builds and the four configurations run sequentially (there's an option for that), starting with the C tests. I would recommend configuring your own pyregr test jobs (if you have any) for manual runs by disabling all of their triggers. Stefan ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] CEP1000: Native dispatch through callables
On 13 April 2012 14:27, Dag Sverre Seljebotn wrote: > On 04/13/2012 03:01 PM, mark florisson wrote: >> >> On 13 April 2012 13:48, Stefan Behnel wrote: >>> >>> Stefan Behnel, 13.04.2012 14:27: Dag Sverre Seljebotn, 13.04.2012 13:59: > > Requiring interning is somewhat less elegant in one way, but it makes a > lot > of other stuff much simpler. > > That gives us > > struct { > void *pointer; > PyBytesObject *signature; > } record; > > and then you allocate a NULL-terminated arrays of these for all the > overloads. However, the problem is the setup. These references will have to be created at init time and discarded during runtime termination. Not a problem for Cython generated code, but some overhead for hand written code. Since the size of these structs is not a problem, I'd prefer keeping Python objects out of the game and using an ssize_t ID instead, inferred from a char* signature at module init time by calling a C-API function. That avoids the need for any cleanup. >>> >>> >>> Actually, we could even use interned char* values. Nothing keeps that >>> C-API >>> setup function from reassigning the "char* signature" field to the char* >>> buffer of an internally allocated byte string. Except that we'd have to >>> *require* users to use literals or otherwise statically allocated C >>> strings >>> in that field. Hmm, maybe not the best idea ever... >>> >>> Stefan >>> ___ >>> cython-devel mailing list >>> cython-devel@python.org >>> http://mail.python.org/mailman/listinfo/cython-devel >> >> >> You could create a module shared by all versions and projects, which >> exposes a function 'get_signature', which given a char *signature >> returns the pointer that should be used in the ABI signature type >> information. You can then always compare by identity. > > > I fail to see how this is different from what I proposed, with interning > bytes objects (which I still prefer; although the binary-search features of > direct comparison makes that attractive too). It's not really different, more a response to Stefan's comment. > BTW, any proposal that requires an actual project/library that both Cython > and NumPy depends on will fail in the real world. That's fine as long as they use the same way to expose ABI information. As a courtesy though, we could do it anyway, which makes it easier for those respective projects to understand what's involved, how to implement it, and they can then decide whether they want to ship that project as part of their own project. > Dag > > ___ > cython-devel mailing list > cython-devel@python.org > http://mail.python.org/mailman/listinfo/cython-devel ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] CEP1000: Native dispatch through callables
On Fri, Apr 13, 2012 at 4:59 AM, Dag Sverre Seljebotn wrote: > On 04/13/2012 01:38 PM, Stefan Behnel wrote: >> >> Robert Bradshaw, 13.04.2012 12:17: >>> >>> On Fri, Apr 13, 2012 at 1:52 AM, Dag Sverre Seljebotn wrote: On 04/13/2012 01:38 AM, Robert Bradshaw wrote: > > Have you given any thought as to what happens if __call__ is > re-assigned for an object (or subclass of an object) supporting this > interface? Or is this out of scope? Out-of-scope, I'd say. Though you can always write an object that detects if you assign to __call__... >> >> >> +1 for out of scope. This is a pure C level feature. >> >> > Minor nit: I don't think should_dereference is worth branching on, if > one wants to save the allocation one can still use a variable-sized > type and point to oneself. Yes, that's an extra dereference, but the > memory is already likely close and it greatly simplifies the logic. > But I could be wrong here. Those minor nits are exactly what I seek; since Travis will have the first implementation in numba<->SciPy, I just want to make sure that what he does will work efficiently work Cython. >>> >>> >>> +1 >>> >>> I have to admit building/invoking these var-arg-sized __nativecall__ >>> records seems painful. Here's another suggestion: >>> >>> struct { >>> void* pointer; >>> size_t signature; // compressed binary representation, 95% coverage > > Once you start passing around functions that take memory view slices as > arguments, that 95% estimate will be off I think. We have (on the high-performance systems we care about) 64-bits here. If we limit ourselves to a 6-bit alphabet, that gives a trivial encoding for up to 10 chars. We could be more clever here (Huffman coding) but that might be overkill. More importantly though, the "complicated" signatures are likely to be so cheap that the strcmp overhead matters. >>> char* long_signature; // used if signature is not representable in >>> a size_t, as indicated by signature = 0 >>> } record; >>> >>> These char* could optionally be allocated at the end of the record* >>> for optimal locality. We could even dispense with the binary >>> signature, but having that option allows us to avoid strcmp for stuff >>> like d)d and ffi)f. >> >> >> Assuming we use literals and a const char* for the signature, the C >> compiler would cut down the number of signature strings automatically for >> us. And a pointer comparison is the same as a size_t comparison. > > > I'll go one further: Intern Python bytes objects. It's just a PyObject*, but > it's *required* (or just strongly encouraged) to have gone through > > sig = sys.modules['_nativecall']['interned_db'].setdefault(sig, sig) > > Obviously in a PEP you'd have a C-API function for such interning > (completely standalone utility). Performance of interning operation itself > doesn't matter... > > Unless CPython has interning features itself, like in Java? Was that present > back in the day and then ripped out? > > Requiring interning is somewhat less elegant in one way, but it makes a lot > of other stuff much simpler. > > That gives us > > struct { > void *pointer; > PyBytesObject *signature; > } record; > > and then you allocate a NULL-terminated arrays of these for all the > overloads. Global interning is a nice idea. The one drawback I see is that it becomes much more expensive for dynamically calculated signatures. >> >> That would only apply at a per-module level, though, so it would require >> an >> indirection for the signature IDs. But it would avoid a global registry. >> >> Another idea would be to set the signature ID field to 0 at the beginning >> and call a C-API function to let the current runtime assign an ID> 0, >> unique for the currently running application. Then every user would only >> have to parse the signature once to adapt to the respective ID and could >> otherwise branch based on it directly. >> >> For Cython, we could generate a static ID variable for each typed call >> that >> we found in the sources. When encountering a C signature on a callable, >> either a) the ID variable is still empty (initial case), then we parse the >> signature to see if it matches the expected signature. If it does, we >> assign the corresponding ID to the static ID variable and issue a direct >> call. If b) the ID field is already set (normal case), we compare the >> signature IDs directly and issue a C call it they match. If the IDs do not >> match, we issue a normal Python call. If I understand correctly, you're proposing struct { char* sig; long id; } sig_t; Where comparison would (sometimes?) compute id from sig by augmenting a global counter and dict? Might be expensive to bootstrap, but eventually all relevant ids would be filled in and it would be quick. Interesting. I wonder what the performance penalty would be over assuming id is statically computed lots of the time, and u
Re: [Cython] CEP1000: Native dispatch through callables
On Fri, Apr 13, 2012 at 5:48 AM, mark florisson wrote: > On 13 April 2012 12:59, Dag Sverre Seljebotn > wrote: >> On 04/13/2012 01:38 PM, Stefan Behnel wrote: >>> >>> Robert Bradshaw, 13.04.2012 12:17: On Fri, Apr 13, 2012 at 1:52 AM, Dag Sverre Seljebotn wrote: > > On 04/13/2012 01:38 AM, Robert Bradshaw wrote: >> >> Have you given any thought as to what happens if __call__ is >> re-assigned for an object (or subclass of an object) supporting this >> interface? Or is this out of scope? > > > Out-of-scope, I'd say. Though you can always write an object that > detects if > you assign to __call__... >>> >>> >>> +1 for out of scope. This is a pure C level feature. >>> >>> >> Minor nit: I don't think should_dereference is worth branching on, if >> one wants to save the allocation one can still use a variable-sized >> type and point to oneself. Yes, that's an extra dereference, but the >> memory is already likely close and it greatly simplifies the logic. >> But I could be wrong here. > > > > Those minor nits are exactly what I seek; since Travis will have the > first > implementation in numba<->SciPy, I just want to make sure that what he > does > will work efficiently work Cython. +1 I have to admit building/invoking these var-arg-sized __nativecall__ records seems painful. Here's another suggestion: struct { void* pointer; size_t signature; // compressed binary representation, 95% coverage >> >> >> Once you start passing around functions that take memory view slices as >> arguments, that 95% estimate will be off I think. >> > > It kind of depends on which arguments types and how many arguments you > will allow, and whether or not collisions would be fine (which would > imply ID comparison + strcmp()). Interesting idea, though this has the drawback of doubling (at least) the overhead of the simple (important) case as well as memory requirements/locality issues. char* long_signature; // used if signature is not representable in a size_t, as indicated by signature = 0 } record; These char* could optionally be allocated at the end of the record* for optimal locality. We could even dispense with the binary signature, but having that option allows us to avoid strcmp for stuff like d)d and ffi)f. >>> >>> >>> Assuming we use literals and a const char* for the signature, the C >>> compiler would cut down the number of signature strings automatically for >>> us. And a pointer comparison is the same as a size_t comparison. >> >> >> I'll go one further: Intern Python bytes objects. It's just a PyObject*, but >> it's *required* (or just strongly encouraged) to have gone through >> >> sig = sys.modules['_nativecall']['interned_db'].setdefault(sig, sig) >> >> Obviously in a PEP you'd have a C-API function for such interning >> (completely standalone utility). Performance of interning operation itself >> doesn't matter... >> >> Unless CPython has interning features itself, like in Java? Was that present >> back in the day and then ripped out? >> >> Requiring interning is somewhat less elegant in one way, but it makes a lot >> of other stuff much simpler. >> >> That gives us >> >> struct { >> void *pointer; >> PyBytesObject *signature; >> } record; >> >> and then you allocate a NULL-terminated arrays of these for all the >> overloads. >> > > Interesting. What I like about size_t it that it could define a > deterministic ordering, which means specializations could be stored in > a binary search tree in array form. I think the number of specializations would have to be quite large (>10, maybe 100) before a binary search wins out over a simple scan, but if we stored a count rather than did a null-terminated array teh lookup function could take this into account. (The header will already have plenty of room if we're storing a version number and want the records to be properly aligned.) Requiring them to be sorted would also allow us to abort on average half way through a scan. Of course prioritizing the "likely" signatures first may be more of a win. > Cython would precompute the size_t > for the specialization it needs (and maybe account for promotions as > well). Exactly. >>> That would only apply at a per-module level, though, so it would require >>> an >>> indirection for the signature IDs. But it would avoid a global registry. >>> >>> Another idea would be to set the signature ID field to 0 at the beginning >>> and call a C-API function to let the current runtime assign an ID> 0, >>> unique for the currently running application. Then every user would only >>> have to parse the signature once to adapt to the respective ID and could >>> otherwise branch based on it directly. >>> >>> For Cython, we could generate a static ID variable for each typed call >>> that >>> we found in the sources. When encounteri
Re: [Cython] CEP1000: Native dispatch through callables
On Fri, Apr 13, 2012 at 10:26 AM, Robert Bradshaw wrote: > On Fri, Apr 13, 2012 at 4:59 AM, Dag Sverre Seljebotn > wrote: >> On 04/13/2012 01:38 PM, Stefan Behnel wrote: >>> That would only apply at a per-module level, though, so it would require >>> an >>> indirection for the signature IDs. But it would avoid a global registry. >>> >>> Another idea would be to set the signature ID field to 0 at the beginning >>> and call a C-API function to let the current runtime assign an ID> 0, >>> unique for the currently running application. Then every user would only >>> have to parse the signature once to adapt to the respective ID and could >>> otherwise branch based on it directly. >>> >>> For Cython, we could generate a static ID variable for each typed call >>> that >>> we found in the sources. When encountering a C signature on a callable, >>> either a) the ID variable is still empty (initial case), then we parse the >>> signature to see if it matches the expected signature. If it does, we >>> assign the corresponding ID to the static ID variable and issue a direct >>> call. If b) the ID field is already set (normal case), we compare the >>> signature IDs directly and issue a C call it they match. If the IDs do not >>> match, we issue a normal Python call. > > If I understand correctly, you're proposing > > struct { > char* sig; > long id; > } sig_t; > > Where comparison would (sometimes?) compute id from sig by augmenting > a global counter and dict? Might be expensive to bootstrap, but > eventually all relevant ids would be filled in and it would be quick. > Interesting. I wonder what the performance penalty would be over > assuming id is statically computed lots of the time, and using that to > compare against fixed values. And there's memory locality issues as > well. To clarify, I'd really like to have the following as fast as possible: if (callable.sig.id == X) { // yep, that's what I thought } else { // generic call } Alternatively, one can imagine wanting to do: switch (callable.sig.id) { case X: // I can do this case Y: // this is common and fast as well ... default: // generic call } There is some question about how promotion should work (e.g. should this flexibility reside in the caller or the callee (or both, though that could result in a quadratic number of comparisons)?) - Robert ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] CEP1000: Native dispatch through callables
Stefan Behnel, 13.04.2012 07:24: > Dag Sverre Seljebotn, 13.04.2012 00:34: >> On 04/13/2012 12:11 AM, Dag Sverre Seljebotn wrote: >> http://wiki.cython.org/enhancements/cep1000 > > I'm all for doing something in this direction and have been hinting at it > on the PyPy mailing list for a while, without reaction so far. I'll trigger > them again, with a pointer to this discussion and the CEP. PyPy should be > totally interested in a generic way to do fast calls into wrapped C code in > general and Cython implemented functions specifically. Their JIT would then > look at the function at runtime and unwrap it. I just learned that the support in PyPy would be rather straight forward. It already supports calling native code with a known signature through their "rlib/libffi.py" module, so all that remains to be done on their side is mapping the encoded signature to their own signature configuration. Stefan ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] CEP1000: Native dispatch through callables
On Fri, Apr 13, 2012 at 12:15 PM, Stefan Behnel wrote: > Stefan Behnel, 13.04.2012 07:24: >> Dag Sverre Seljebotn, 13.04.2012 00:34: >>> On 04/13/2012 12:11 AM, Dag Sverre Seljebotn wrote: >>> http://wiki.cython.org/enhancements/cep1000 >> >> I'm all for doing something in this direction and have been hinting at it >> on the PyPy mailing list for a while, without reaction so far. I'll trigger >> them again, with a pointer to this discussion and the CEP. PyPy should be >> totally interested in a generic way to do fast calls into wrapped C code in >> general and Cython implemented functions specifically. Their JIT would then >> look at the function at runtime and unwrap it. > > I just learned that the support in PyPy would be rather straight forward. > It already supports calling native code with a known signature through > their "rlib/libffi.py" module, Cool. > so all that remains to be done on their side > is mapping the encoded signature to their own signature configuration. Or looking into borrowing theirs? (We might want more extensibility, e.g. declaring buffer types and nogil/exception data. I assume ctypes has a signature declaration format as well, right?) - Robert ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] CEP1000: Native dispatch through callables
Robert Bradshaw, 13.04.2012 21:26: > On Fri, Apr 13, 2012 at 12:15 PM, Stefan Behnel wrote: >> Stefan Behnel, 13.04.2012 07:24: >>> Dag Sverre Seljebotn, 13.04.2012 00:34: On 04/13/2012 12:11 AM, Dag Sverre Seljebotn wrote: http://wiki.cython.org/enhancements/cep1000 >>> >>> I'm all for doing something in this direction and have been hinting at it >>> on the PyPy mailing list for a while, without reaction so far. I'll trigger >>> them again, with a pointer to this discussion and the CEP. PyPy should be >>> totally interested in a generic way to do fast calls into wrapped C code in >>> general and Cython implemented functions specifically. Their JIT would then >>> look at the function at runtime and unwrap it. >> >> I just learned that the support in PyPy would be rather straight forward. >> It already supports calling native code with a known signature through >> their "rlib/libffi.py" module, > > Cool. > >> so all that remains to be done on their side >> is mapping the encoded signature to their own signature configuration. > > Or looking into borrowing theirs? (We might want more extensibility, > e.g. declaring buffer types and nogil/exception data. I assume ctypes > has a signature declaration format as well, right?) PyPy's ctypes implementation is based on libffi. However, I think neither of the two has a declaration format (e.g. string based) other than the object based declaration notation. You basically pass them a sequence of type objects to declare the signature. That's not really easy to map to the C level - at least not efficiently... Stefan ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] CEP1000: Native dispatch through callables
Robert Bradshaw, 13.04.2012 20:21: > On Fri, Apr 13, 2012 at 10:26 AM, Robert Bradshaw wrote: >> On Fri, Apr 13, 2012 at 4:59 AM, Dag Sverre Seljebotn wrote: >>> On 04/13/2012 01:38 PM, Stefan Behnel wrote: That would only apply at a per-module level, though, so it would require an indirection for the signature IDs. But it would avoid a global registry. Another idea would be to set the signature ID field to 0 at the beginning and call a C-API function to let the current runtime assign an ID> 0, unique for the currently running application. Then every user would only have to parse the signature once to adapt to the respective ID and could otherwise branch based on it directly. For Cython, we could generate a static ID variable for each typed call that we found in the sources. When encountering a C signature on a callable, either a) the ID variable is still empty (initial case), then we parse the signature to see if it matches the expected signature. If it does, we assign the corresponding ID to the static ID variable and issue a direct call. If b) the ID field is already set (normal case), we compare the signature IDs directly and issue a C call it they match. If the IDs do not match, we issue a normal Python call. >> >> If I understand correctly, you're proposing >> >> struct { >> char* sig; >> long id; >> } sig_t; >> >> Where comparison would (sometimes?) compute id from sig by augmenting >> a global counter and dict? Might be expensive to bootstrap, but >> eventually all relevant ids would be filled in and it would be quick. Yes. If a function is only called once, the overhead won't matter. And starting from the second call, it would either be fast if the function signature matches or slow anyway if it doesn't match. >> Interesting. I wonder what the performance penalty would be over >> assuming id is statically computed lots of the time, and using that to >> compare against fixed values. And there's memory locality issues as >> well. > > To clarify, I'd really like to have the following as fast as possible: > > if (callable.sig.id == X) { >// yep, that's what I thought > } else { >// generic call > } > > Alternatively, one can imagine wanting to do: > > switch (callable.sig.id) { > case X: > // I can do this > case Y: > // this is common and fast as well > ... > default: > // generic call > } Yes, that's the idea. > There is some question about how promotion should work (e.g. should > this flexibility reside in the caller or the callee (or both, though > that could result in a quadratic number of comparisons)?) Callees could expose multiple signatures (which would result in a direct call for each, without further comparisons), then the caller would have to choose between those. However, if none matches exactly, the caller might want to promote its arguments and try more signatures. In any case, it's the caller that does the work, never the callee. We could generate code like this: /* cdef int x = ... * cdef long y = ... * cdef int z # interesting: what if z is not typed? * z = func(x, y) */ if (func.sig.id == id("[int,long] -> int")) { z = ((cast)func.cfunc) (x,y); } else if (sizeof(long) > sizeof(int) && (func.sig.id == id("[long,long] -> int"))) { z = ((cast)func.cfunc) ((long)x, y); } etc. ... else { /* pack and call as Python function */ } Meaning, the C compiler could reduce the amount of optimistic call code at compile time. Stefan ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] CEP1000: Native dispatch through callables
Ah, I didn't think about 6-bit or huffman. Certainly helps. I'm almost +1 on your proposal now, but a couple of more ideas: 1) Let the key (the size_t) spill over to the next specialization entry if it is too large; and prepend that key with a continuation code (two size-ts could together say "iii)-d\0\0" on 32 bit systems with 8bit encoding, using - as continuation). The key-based caller will expect a continuation if it knows about the specialization, and the prepended char will prevent spurios matches against the overspilled slot. We could even use the pointers for part of the continuation... 2) Separate the char* format strings from the keys, ie this memory layout: Version,nslots,nspecs,funcptr,key,funcptr,key,...,sigcharptr,sigcharptr... Where nslots is larger than nspecs if there are continuations. OK, this is getting close to my original proposal, but the difference is the contiunation char, so that if you expect a short signature, you can safely scan every slot and branching and no null-checking necesarry. Dag -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. Robert Bradshaw wrote: On Fri, Apr 13, 2012 at 4:59 AM, Dag Sverre Seljebotn wrote: > On 04/13/2012 01:38 PM, Stefan Behnel wrote: >> >> Robert Bradshaw, 13.04.2012 12:17: >>> >>> On Fri, Apr 13, 2012 at 1:52 AM, Dag Sverre Seljebotn wrote: On 04/13/2012 01:38 AM, Robert Bradshaw wrote: > > Have you given any thought as to what happens if __call__ is > re-assigned for an object (or subclass of an object) supporting this > interface? Or is this out of scope? Out-of-scope, I'd say. Though you can always write an object that detects if you assign to __call__... >> >> >> +1 for out of scope. This is a pure C level feature. >> >> > Minor nit: I don't think should_dereference is worth branching on, if > one wants to save the allocation one can still use a variable-sized > type and point to oneself. Yes, that's an extra dereference, but the > memory is already likely close and it greatly simplifies the logic. > But I could be wrong here. Those minor nits are exactly what I seek; since Travis will have the first implementation in numba<->SciPy, I just want to make sure that what he does will work efficiently work Cython. >>> >>> >>> +1 >>> >>> I have to admit building/invoking these var-arg-sized __nativecall__ >>> records seems painful. Here's another suggestion: >>> >>> struct { >>> void* pointer; >>> size_t signature; // compressed binary representation, 95% coverage > > Once you start passing around functions that take memory view slices as > arguments, that 95% estimate will be off I think. We have (on the high-performance systems we care about) 64-bits here. If we limit ourselves to a 6-bit alphabet, that gives a trivial encoding for up to 10 chars. We could be more clever here (Huffman coding) but that might be overkill. More importantly though, the "complicated" signatures are likely to be so cheap that the strcmp overhead matters. >>> char* long_signature; // used if signature is not representable in >>> a size_t, as indicated by signature = 0 >>> } record; >>> >>> These char* could optionally be allocated at the end of the record* >>> for optimal locality. We could even dispense with the binary >>> signature, but having that option allows us to avoid strcmp for stuff >>> like d)d and ffi)f. >> >> >> Assuming we use literals and a const char* for the signature, the C >> compiler would cut down the number of signature strings automatically for >> us. And a pointer comparison is the same as a size_t comparison. > > > I'll go one further: Intern Python bytes objects. It's just a PyObject*, but > it's *required* (or just strongly encouraged) to have gone through > > sig = sys.modules['_nativecall']['interned_db'].setdefault(sig, sig) > > Obviously in a PEP you'd have a C-API function for such interning > (completely standalone utility). Performance of interning operation itself > doesn't matter... > > Unless CPython has interning features itself, like in Java? Was that present > back in the day and then ripped out? > > Requiring interning is somewhat less elegant in one way, but it makes a lot > of other stuff much simpler. > > That gives us > > struct { >void *pointer; >PyBytesObject *signature; > } record; > > and then you allocate a NULL-terminated arrays of these for all the > overloads. Global interning is a nice idea. The one drawback I see is that it becomes much more expensive for dynamically calculated signatures. >> >> That would only apply at a per-module level, though, so it would require >> an >> indirection for the signature IDs. But it would avoid a global registry. >> >> Another idea would be to set the signature ID field to 0 at the beginning >> and call a C-API function to let the current runtime assign an ID> 0, >> unique for the currently running applica
Re: [Cython] CEP1000: Native dispatch through callables
On Fri, Apr 13, 2012 at 12:52 PM, Stefan Behnel wrote: > Robert Bradshaw, 13.04.2012 20:21: >> On Fri, Apr 13, 2012 at 10:26 AM, Robert Bradshaw wrote: >>> On Fri, Apr 13, 2012 at 4:59 AM, Dag Sverre Seljebotn wrote: On 04/13/2012 01:38 PM, Stefan Behnel wrote: > That would only apply at a per-module level, though, so it would > require an indirection for the signature IDs. But it would avoid a > global registry. > > Another idea would be to set the signature ID field to 0 at the beginning > and call a C-API function to let the current runtime assign an ID> 0, > unique for the currently running application. Then every user would only > have to parse the signature once to adapt to the respective ID and could > otherwise branch based on it directly. > > For Cython, we could generate a static ID variable for each typed call > that > we found in the sources. When encountering a C signature on a callable, > either a) the ID variable is still empty (initial case), then we parse the > signature to see if it matches the expected signature. If it does, we > assign the corresponding ID to the static ID variable and issue a direct > call. If b) the ID field is already set (normal case), we compare the > signature IDs directly and issue a C call it they match. If the IDs do not > match, we issue a normal Python call. >>> >>> If I understand correctly, you're proposing >>> >>> struct { >>> char* sig; >>> long id; >>> } sig_t; >>> >>> Where comparison would (sometimes?) compute id from sig by augmenting >>> a global counter and dict? Might be expensive to bootstrap, but >>> eventually all relevant ids would be filled in and it would be quick. > > Yes. If a function is only called once, the overhead won't matter. And > starting from the second call, it would either be fast if the function > signature matches or slow anyway if it doesn't match. There's still data locality issues, including the cached id for the caller as well as the callee. >>> Interesting. I wonder what the performance penalty would be over >>> assuming id is statically computed lots of the time, and using that to >>> compare against fixed values. And there's memory locality issues as >>> well. >> >> To clarify, I'd really like to have the following as fast as possible: >> >> if (callable.sig.id == X) { >> // yep, that's what I thought >> } else { >> // generic call >> } >> >> Alternatively, one can imagine wanting to do: >> >> switch (callable.sig.id) { >> case X: >> // I can do this >> case Y: >> // this is common and fast as well >> ... >> default: >> // generic call >> } > > Yes, that's the idea. > > >> There is some question about how promotion should work (e.g. should >> this flexibility reside in the caller or the callee (or both, though >> that could result in a quadratic number of comparisons)?) > > Callees could expose multiple signatures (which would result in a direct > call for each, without further comparisons), then the caller would have to > choose between those. However, if none matches exactly, the caller might > want to promote its arguments and try more signatures. In any case, it's > the caller that does the work, never the callee. > > We could generate code like this: > > /* cdef int x = ... > * cdef long y = ... > * cdef int z # interesting: what if z is not typed? > * z = func(x, y) > */ > > if (func.sig.id == id("[int,long] -> int")) { > z = ((cast)func.cfunc) (x,y); > } else if (sizeof(long) > sizeof(int) && > (func.sig.id == id("[long,long] -> int"))) { > z = ((cast)func.cfunc) ((long)x, y); > } etc. ... else { > /* pack and call as Python function */ > } > > Meaning, the C compiler could reduce the amount of optimistic call code at > compile time. Interesting idea. Alternatively, I wonder if the signature could reflect exactly-sized types rather than int/long/etc. Perhaps that would make the code more complicated on both ends... I'm assuming your id(...) is computed at compile time in this example, right? Otherwise it would get a bit messier. - Robert ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] CEP1000: Native dispatch through callables
On Fri, Apr 13, 2012 at 1:27 PM, Dag Sverre Seljebotn wrote: > Ah, I didn't think about 6-bit or huffman. Certainly helps. Yeah, we don't want to complicate the ABI too much, but I think something like 8 4-bit common chars and 32 6-bit other chars (or 128 8-bit other chars) wouldn't be outrageous. The fact that we only have to encode into a single word makes the algorithm very simple (though the majority of the time we'd spit out pre-encoded literals). We have a version number to play with this as well. > I'm almost +1 on your proposal now, but a couple of more ideas: > > 1) Let the key (the size_t) spill over to the next specialization entry if > it is too large; and prepend that key with a continuation code (two size-ts > could together say "iii)-d\0\0" on 32 bit systems with 8bit encoding, using > - as continuation). The key-based caller will expect a continuation if it > knows about the specialization, and the prepended char will prevent spurios > matches against the overspilled slot. > > We could even use the pointers for part of the continuation... > > 2) Separate the char* format strings from the keys, ie this memory layout: > > Version,nslots,nspecs,funcptr,key,funcptr,key,...,sigcharptr,sigcharptr... > > Where nslots is larger than nspecs if there are continuations. > > OK, this is getting close to my original proposal, but the difference is the > contiunation char, so that if you expect a short signature, you can safely > scan every slot and branching and no null-checking necesarry. I don't think we need nslots (though it might be interesting). My thought is that once you start futzing with variable-length keys, you might as well just compare char*s. If one is concerned about memory, one could force the sigcharptr to be aligned, and then the "keys" could be either sigcharptr or key depending on whether the least significant bit was set. One could easily scan for/switch on a key and scanning for a char* would be almost as easy (just don't dereference if the lsb is set). I don't see us being memory constrained, so (version,nspecs,futureuse),(key,sigcharptr,funcptr)*,optionalsigchardata* seems fine to me even if only one of key/sigchrptr is ever used per spec. Null-terminating the specs would work fine as well (one less thing to keep track of during iteration). - Robert ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] CEP1000: Native dispatch through callables
On Fri, Apr 13, 2012 at 9:27 PM, Dag Sverre Seljebotn wrote: > Ah, I didn't think about 6-bit or huffman. Certainly helps. > > I'm almost +1 on your proposal now, but a couple of more ideas: > > 1) Let the key (the size_t) spill over to the next specialization entry if > it is too large; and prepend that key with a continuation code (two size-ts > could together say "iii)-d\0\0" on 32 bit systems with 8bit encoding, using > - as continuation). The key-based caller will expect a continuation if it > knows about the specialization, and the prepended char will prevent spurios > matches against the overspilled slot. > > We could even use the pointers for part of the continuation... I am really lost here. Why is any of this complicated encoding stuff better than interning? Interning takes one line of code, is incredibly cheap (one dict lookup per call site and function definition), and it lets you check any possible signature (even complicated ones involving memoryviews) by doing a single-word comparison. And best of all, you don't have to think hard to make sure you got the encoding right. ;-) On a 32-bit system, pointers are smaller than a size_t, but more expressive! You can still do binary search if you want, etc. Is the problem just that interning requires a runtime calculation? Because I feel like C users (like numpy) will want to compute these compressed codes at module-init anyway, and those of us with a fancy compiler capable of computing them ahead of time (like Cython) can instruct that fancy compiler to compute them at module-init time just as easily? -- Nathaniel ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] CEP1000: Native dispatch through callables
On Fri, Apr 13, 2012 at 2:24 PM, Nathaniel Smith wrote: > On Fri, Apr 13, 2012 at 9:27 PM, Dag Sverre Seljebotn > wrote: >> Ah, I didn't think about 6-bit or huffman. Certainly helps. >> >> I'm almost +1 on your proposal now, but a couple of more ideas: >> >> 1) Let the key (the size_t) spill over to the next specialization entry if >> it is too large; and prepend that key with a continuation code (two size-ts >> could together say "iii)-d\0\0" on 32 bit systems with 8bit encoding, using >> - as continuation). The key-based caller will expect a continuation if it >> knows about the specialization, and the prepended char will prevent spurios >> matches against the overspilled slot. >> >> We could even use the pointers for part of the continuation... > > I am really lost here. Why is any of this complicated encoding stuff > better than interning? Interning takes one line of code, is incredibly > cheap (one dict lookup per call site and function definition), and it > lets you check any possible signature (even complicated ones involving > memoryviews) by doing a single-word comparison. And best of all, you > don't have to think hard to make sure you got the encoding right. ;-) > > On a 32-bit system, pointers are smaller than a size_t, but more > expressive! You can still do binary search if you want, etc. Is the > problem just that interning requires a runtime calculation? Because I > feel like C users (like numpy) will want to compute these compressed > codes at module-init anyway, and those of us with a fancy compiler > capable of computing them ahead of time (like Cython) can instruct > that fancy compiler to compute them at module-init time just as > easily? Good question. The primary disadvantage of interning that I see is memory locality. I suppose if all the C-level caches of interned values were co-located, this may not be as big of an issue. Not being able to compare against compile-time constants may thwart some optimization opportunities, but that's less clear. It also requires coordination common repository, but I suppose one would just stick a set in some standard module (or leverage Python's interning). - Robert ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] CEP1000: Native dispatch through callables
Robert Bradshaw wrote: >On Fri, Apr 13, 2012 at 1:27 PM, Dag Sverre Seljebotn > wrote: >> Ah, I didn't think about 6-bit or huffman. Certainly helps. > >Yeah, we don't want to complicate the ABI too much, but I think >something like 8 4-bit common chars and 32 6-bit other chars (or 128 >8-bit other chars) wouldn't be outrageous. The fact that we only have >to encode into a single word makes the algorithm very simple (though >the majority of the time we'd spit out pre-encoded literals). We have >a version number to play with this as well. > >> I'm almost +1 on your proposal now, but a couple of more ideas: >> >> 1) Let the key (the size_t) spill over to the next specialization >entry if >> it is too large; and prepend that key with a continuation code (two >size-ts >> could together say "iii)-d\0\0" on 32 bit systems with 8bit encoding, >using >> - as continuation). The key-based caller will expect a continuation >if it >> knows about the specialization, and the prepended char will prevent >spurios >> matches against the overspilled slot. >> >> We could even use the pointers for part of the continuation... >> >> 2) Separate the char* format strings from the keys, ie this memory >layout: >> >> >Version,nslots,nspecs,funcptr,key,funcptr,key,...,sigcharptr,sigcharptr... >> >> Where nslots is larger than nspecs if there are continuations. >> >> OK, this is getting close to my original proposal, but the difference >is the >> contiunation char, so that if you expect a short signature, you can >safely >> scan every slot and branching and no null-checking necesarry. > >I don't think we need nslots (though it might be interesting). My >thought is that once you start futzing with variable-length keys, you >might as well just compare char*s. This is where we disagree. If you are the caller you know at compile-time how much you want to match; I think comparing 2 or 3 size-t with no looping is a lot better (a fully-unrolled, 64-bit per instruction strcmp with one of the operands known to the compiler...). > >If one is concerned about memory, one could force the sigcharptr to be >aligned, and then the "keys" could be either sigcharptr or key >depending on whether the least significant bit was set. One could >easily scan for/switch on a key and scanning for a char* would be >almost as easy (just don't dereference if the lsb is set). > >I don't see us being memory constrained, so > >(version,nspecs,futureuse),(key,sigcharptr,funcptr)*,optionalsigchardata* > >seems fine to me even if only one of key/sigchrptr is ever used per >spec. Null-terminating the specs would work fine as well (one less >thing to keep track of during iteration). Well, can't one always use more L1 cache, or is that not a concern? If you have 5-6 different routines calling each other using this mechanism, each with multiple specializations, those unused slots translate to many cache lines wasted. I don't think it is that important, I just think that how pretty the C struct declaration ends up looking should not be a concern at all, when the whole point of this is speed anyway. You can always just use a throwaway struct declaration and a cast to get whatever layout you need. If the 'padding' leads to less branching then fine, but I don't see that it helps in any way. To refine my proposal a bit, we have a list of variable size entries, (keydata, keydata, ..., funcptr) where each keydata and the ptr is 64 bits on all platforms (see below); each entry must have a total length multiple of 128 bits (so that one can safely scan for a signature in 128 bit increments in the data *without* parsing or branching, you'll never hit a pointer), and each key but the first starts with a 'dash'. Signature strings are either kept separate, or even parsed/decoded from the keys. We really only care about speed when you have compiled or JITed code for the case, decoding should be fine otherwise. BTW, won't the Cython-generated C code be a horrible mess if we use size-t rather than insist on int64t? (ok, those need some ifdefs for various compilers, but still seem cleaner than operating with 32bit and 64bit keys, and stdint.h is winning ground). Dag > >- Robert >___ >cython-devel mailing list >cython-devel@python.org >http://mail.python.org/mailman/listinfo/cython-devel -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] CEP1000: Native dispatch through callables
Robert Bradshaw wrote: >On Fri, Apr 13, 2012 at 2:24 PM, Nathaniel Smith wrote: >> On Fri, Apr 13, 2012 at 9:27 PM, Dag Sverre Seljebotn >> wrote: >>> Ah, I didn't think about 6-bit or huffman. Certainly helps. >>> >>> I'm almost +1 on your proposal now, but a couple of more ideas: >>> >>> 1) Let the key (the size_t) spill over to the next specialization >entry if >>> it is too large; and prepend that key with a continuation code (two >size-ts >>> could together say "iii)-d\0\0" on 32 bit systems with 8bit >encoding, using >>> - as continuation). The key-based caller will expect a continuation >if it >>> knows about the specialization, and the prepended char will prevent >spurios >>> matches against the overspilled slot. >>> >>> We could even use the pointers for part of the continuation... >> >> I am really lost here. Why is any of this complicated encoding stuff >> better than interning? Interning takes one line of code, is >incredibly >> cheap (one dict lookup per call site and function definition), and it >> lets you check any possible signature (even complicated ones >involving >> memoryviews) by doing a single-word comparison. And best of all, you >> don't have to think hard to make sure you got the encoding right. ;-) >> >> On a 32-bit system, pointers are smaller than a size_t, but more >> expressive! You can still do binary search if you want, etc. Is the >> problem just that interning requires a runtime calculation? Because I >> feel like C users (like numpy) will want to compute these compressed >> codes at module-init anyway, and those of us with a fancy compiler >> capable of computing them ahead of time (like Cython) can instruct >> that fancy compiler to compute them at module-init time just as >> easily? > >Good question. > >The primary disadvantage of interning that I see is memory locality. I >suppose if all the C-level caches of interned values were co-located, >this may not be as big of an issue. Not being able to compare against >compile-time constants may thwart some optimization opportunities, but >that's less clear. > >It also requires coordination common repository, but I suppose one >would just stick a set in some standard module (or leverage Python's >interning). More problems: 1) It doesn't work well with multiple interpreter states. Ok, nothing works with that at the moment, but it is on the roadmap for Python and we should not make it worse. You basically *need* a thread safe store separate from any python interpreter; though pythread.h does not rely on the interpreter state; which helps. 2) you end up with the known comparison values in read-write memory segments rather than readonly segments, which is probably worse on multicore systems? I really think that anything that we can do to make this near-c-speed should be done; none of the proposals are *that* complicated. Using keys, NumPy can in the C code choose to be slower but more readable; but using interned string forces cython to be slower, cython gets no way of choosing to go faster. (to the degree that it has an effect; none of these claims were checked) Dag > >- Robert >___ >cython-devel mailing list >cython-devel@python.org >http://mail.python.org/mailman/listinfo/cython-devel -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] CEP1000: Native dispatch through callables
Dag Sverre Seljebotn wrote: > > >Robert Bradshaw wrote: > >>On Fri, Apr 13, 2012 at 2:24 PM, Nathaniel Smith >wrote: >>> On Fri, Apr 13, 2012 at 9:27 PM, Dag Sverre Seljebotn >>> wrote: Ah, I didn't think about 6-bit or huffman. Certainly helps. I'm almost +1 on your proposal now, but a couple of more ideas: 1) Let the key (the size_t) spill over to the next specialization >>entry if it is too large; and prepend that key with a continuation code (two >>size-ts could together say "iii)-d\0\0" on 32 bit systems with 8bit >>encoding, using - as continuation). The key-based caller will expect a continuation >>if it knows about the specialization, and the prepended char will prevent >>spurios matches against the overspilled slot. We could even use the pointers for part of the continuation... >>> >>> I am really lost here. Why is any of this complicated encoding stuff >>> better than interning? Interning takes one line of code, is >>incredibly >>> cheap (one dict lookup per call site and function definition), and >it >>> lets you check any possible signature (even complicated ones >>involving >>> memoryviews) by doing a single-word comparison. And best of all, you >>> don't have to think hard to make sure you got the encoding right. >;-) >>> >>> On a 32-bit system, pointers are smaller than a size_t, but more >>> expressive! You can still do binary search if you want, etc. Is the >>> problem just that interning requires a runtime calculation? Because >I >>> feel like C users (like numpy) will want to compute these compressed >>> codes at module-init anyway, and those of us with a fancy compiler >>> capable of computing them ahead of time (like Cython) can instruct >>> that fancy compiler to compute them at module-init time just as >>> easily? >> >>Good question. >> >>The primary disadvantage of interning that I see is memory locality. I >>suppose if all the C-level caches of interned values were co-located, >>this may not be as big of an issue. Not being able to compare against >>compile-time constants may thwart some optimization opportunities, but >>that's less clear. >> >>It also requires coordination common repository, but I suppose one >>would just stick a set in some standard module (or leverage Python's >>interning). > >More problems: > >1) It doesn't work well with multiple interpreter states. Ok, nothing >works with that at the moment, but it is on the roadmap for Python and >we should not make it worse. > >You basically *need* a thread safe store separate from any python >interpreter; though pythread.h does not rely on the interpreter state; >which helps. No, it doesn't, unless we want to ship a single(!) .so-file that can be depended upon by all relevant projects. There's just no way for loaded modules to communicate and synchronize that they know about this CEP except through an interpreter... That's almost impossible to work around in any clean way? (I can think of several very ugly ones...) Unless the multiple interpreter state idea is entirely dead in CPython, interning must be done seperately for each interpreter and the values stored in the module object. Ugh. Dag > >2) you end up with the known comparison values in read-write memory >segments rather than readonly segments, which is probably worse on >multicore systems? > >I really think that anything that we can do to make this near-c-speed >should be done; none of the proposals are *that* complicated. > >Using keys, NumPy can in the C code choose to be slower but more >readable; but using interned string forces cython to be slower, cython >gets no way of choosing to go faster. (to the degree that it has an >effect; none of these claims were checked) > >Dag > > >> >>- Robert >>___ >>cython-devel mailing list >>cython-devel@python.org >>http://mail.python.org/mailman/listinfo/cython-devel > >-- >Sent from my Android phone with K-9 Mail. Please excuse my brevity. >___ >cython-devel mailing list >cython-devel@python.org >http://mail.python.org/mailman/listinfo/cython-devel -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] CEP1000: Native dispatch through callables
On Fri, Apr 13, 2012 at 3:06 PM, Dag Sverre Seljebotn wrote: > > > Robert Bradshaw wrote: > >>On Fri, Apr 13, 2012 at 1:27 PM, Dag Sverre Seljebotn >> wrote: >>> Ah, I didn't think about 6-bit or huffman. Certainly helps. >> >>Yeah, we don't want to complicate the ABI too much, but I think >>something like 8 4-bit common chars and 32 6-bit other chars (or 128 >>8-bit other chars) wouldn't be outrageous. The fact that we only have >>to encode into a single word makes the algorithm very simple (though >>the majority of the time we'd spit out pre-encoded literals). We have >>a version number to play with this as well. >> >>> I'm almost +1 on your proposal now, but a couple of more ideas: >>> >>> 1) Let the key (the size_t) spill over to the next specialization >>entry if >>> it is too large; and prepend that key with a continuation code (two >>size-ts >>> could together say "iii)-d\0\0" on 32 bit systems with 8bit encoding, >>using >>> - as continuation). The key-based caller will expect a continuation >>if it >>> knows about the specialization, and the prepended char will prevent >>spurios >>> matches against the overspilled slot. >>> >>> We could even use the pointers for part of the continuation... >>> >>> 2) Separate the char* format strings from the keys, ie this memory >>layout: >>> >>> >>Version,nslots,nspecs,funcptr,key,funcptr,key,...,sigcharptr,sigcharptr... >>> >>> Where nslots is larger than nspecs if there are continuations. >>> >>> OK, this is getting close to my original proposal, but the difference >>is the >>> contiunation char, so that if you expect a short signature, you can >>safely >>> scan every slot and branching and no null-checking necesarry. >> >>I don't think we need nslots (though it might be interesting). My >>thought is that once you start futzing with variable-length keys, you >>might as well just compare char*s. > > This is where we disagree. If you are the caller you know at compile-time how > much you want to match; I think comparing 2 or 3 size-t with no looping is a > lot better (a fully-unrolled, 64-bit per instruction strcmp with one of the > operands known to the compiler...). Doesn't the compiler unroll strcmp much like this for a known operand? >>If one is concerned about memory, one could force the sigcharptr to be >>aligned, and then the "keys" could be either sigcharptr or key >>depending on whether the least significant bit was set. One could >>easily scan for/switch on a key and scanning for a char* would be >>almost as easy (just don't dereference if the lsb is set). >> >>I don't see us being memory constrained, so >> >>(version,nspecs,futureuse),(key,sigcharptr,funcptr)*,optionalsigchardata* >> >>seems fine to me even if only one of key/sigchrptr is ever used per >>spec. Null-terminating the specs would work fine as well (one less >>thing to keep track of during iteration). > > Well, can't one always use more L1 cache, or is that not a concern? If you > have 5-6 different routines calling each other using this mechanism, each > with multiple specializations, those unused slots translate to many cache > lines wasted. > > I don't think it is that important, I just think that how pretty the C struct > declaration ends up looking should not be a concern at all, when the whole > point of this is speed anyway. You can always just use a throwaway struct > declaration and a cast to get whatever layout you need. If the 'padding' > leads to less branching then fine, but I don't see that it helps in any way. I was more concerned about guaranteeing each char* was aligned. > To refine my proposal a bit, we have a list of variable size entries, > > (keydata, keydata, ..., funcptr) > > where each keydata and the ptr is 64 bits on all platforms (see below); each > entry must have a total length multiple of 128 bits (so that one can safely > scan for a signature in 128 bit increments in the data *without* parsing or > branching, you'll never hit a pointer), and each key but the first starts > with a 'dash'. Ah, OK, similar to UTF-8. Yes, I like this idea. > Signature strings are either kept separate, or even parsed/decoded from the > keys. We really only care about speed when you have compiled or JITed code > for the case, decoding should be fine otherwise. True. > BTW, won't the Cython-generated C code be a horrible mess if we use size-t > rather than insist on int64t? (ok, those need some ifdefs for various > compilers, but still seem cleaner than operating with 32bit and 64bit keys, > and stdint.h is winning ground). Sure, we could require 64-bit keys (and pointer slots). On Fri, Apr 13, 2012 at 3:22 PM, Dag Sverre Seljebotn wrote: >>> I am really lost here. Why is any of this complicated encoding stuff >>> better than interning? Interning takes one line of code, is >>incredibly >>> cheap (one dict lookup per call site and function definition), and it >>> lets you check any possible signature (even complicated ones >>involving >>> memoryviews
Re: [Cython] CEP1000: Native dispatch through callables
On Fri, Apr 13, 2012 at 11:22 PM, Dag Sverre Seljebotn wrote: > > > Robert Bradshaw wrote: > >>On Fri, Apr 13, 2012 at 2:24 PM, Nathaniel Smith wrote: >>> On Fri, Apr 13, 2012 at 9:27 PM, Dag Sverre Seljebotn >>> wrote: Ah, I didn't think about 6-bit or huffman. Certainly helps. I'm almost +1 on your proposal now, but a couple of more ideas: 1) Let the key (the size_t) spill over to the next specialization >>entry if it is too large; and prepend that key with a continuation code (two >>size-ts could together say "iii)-d\0\0" on 32 bit systems with 8bit >>encoding, using - as continuation). The key-based caller will expect a continuation >>if it knows about the specialization, and the prepended char will prevent >>spurios matches against the overspilled slot. We could even use the pointers for part of the continuation... >>> >>> I am really lost here. Why is any of this complicated encoding stuff >>> better than interning? Interning takes one line of code, is >>incredibly >>> cheap (one dict lookup per call site and function definition), and it >>> lets you check any possible signature (even complicated ones >>involving >>> memoryviews) by doing a single-word comparison. And best of all, you >>> don't have to think hard to make sure you got the encoding right. ;-) >>> >>> On a 32-bit system, pointers are smaller than a size_t, but more >>> expressive! You can still do binary search if you want, etc. Is the >>> problem just that interning requires a runtime calculation? Because I >>> feel like C users (like numpy) will want to compute these compressed >>> codes at module-init anyway, and those of us with a fancy compiler >>> capable of computing them ahead of time (like Cython) can instruct >>> that fancy compiler to compute them at module-init time just as >>> easily? >> >>Good question. >> >>The primary disadvantage of interning that I see is memory locality. I >>suppose if all the C-level caches of interned values were co-located, >>this may not be as big of an issue. Not being able to compare against >>compile-time constants may thwart some optimization opportunities, but >>that's less clear. I would like to see some demonstration of this. E.g., you can run this: echo -e '#include \nint main(int argc, char ** argv) { return strcmp(argv[0], "a"); }' | gcc -S -x c - -o - -O2 | less Looks to me like for a short, known-at-compile-time string, with optimization on, gcc implements it by basically sticking the string in a global variable and then using a pointer... (If I do argv[0] == (char *)0x1234, then it places the constant value directly into the instruction stream. Strangely enough, it does *not* inline the constant value even if I do memcmp(&argv[0], "\1\2\3\4", 4), which should be exactly equivalent...!) I think gcc is just as likely to stick a bunch of static void * interned_dd_to_d; static void * interned_ll_to_l; next to each other in the memory image as it is to stick a bunch of equivalent manifest constants. If you're worried, make it static void * interned_signatures[NUM_SIGNATURES] -- then they'll definitely be next to each other. >>It also requires coordination common repository, but I suppose one >>would just stick a set in some standard module (or leverage Python's >>interning). > > More problems: > > 1) It doesn't work well with multiple interpreter states. Ok, nothing works > with that at the moment, but it is on the roadmap for Python and we should > not make it worse. This isn't a criticism, but I'd like to see a reference to the work in this direction! My impression was that it's been on the roadmap for maybe a decade, in a really desultory fashion: http://docs.python.org/faq/library.html#can-t-we-get-rid-of-the-global-interpreter-lock So if it's actually happening that's quite interesting. > You basically *need* a thread safe store separate from any python > interpreter; though pythread.h does not rely on the interpreter state; which > helps. Anyway, yes, if you can't rely on the interpreter than you'd need some place to store the intern table, but I'm not sure why this would be a problem (in Python 3.6 or whenever it becomes relevant). > 2) you end up with the known comparison values in read-write memory segments > rather than readonly segments, which is probably worse on multicore systems? Is it? Can you elaborate? Cache ping-ponging is certainly bad, but that's when multiple cores are writing to the same cache line, I can't see how the TLB flags would matter. I guess the problem would be if you also have some other data in the global variable space that you write to constantly, and then it turned out they were placed next to these read-only comparison values in the same cache line? > I really think that anything that we can do to make this near-c-speed should > be done; none of the proposals are *that* complicated. I agree, but I object to codifying the waving on dead chickens. :-) > Using keys,
Re: [Cython] CEP1000: Native dispatch through callables
Dag Sverre Seljebotn wrote: 1) It doesn't work well with multiple interpreter states. Ok, nothing works with that at the moment, but it is on the roadmap for Python Is it really? I got the impression that it's not considered feasible, since it would require massive changes to the entire implementation and totally break the existing C API. Has someone thought of a way around those problems? -- Greg ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel