Re: [Cython] memoryview slices can't be None?
On Sat, Feb 4, 2012 at 11:39 AM, Dag Sverre Seljebotn wrote: >> >> Block-local declarations are definitely something we want, although I >> think it would require some more (non-trivial) changes to the >> compiler. > > > Note that my proposal was actually not about block-local declarations. > > Block-local: > > { > int x = 4; > } > /* x not available here */ > > My idea was much more like hints to control flow analysis. That is, I wanted > to have this raise an error: > > x = 'adf' > if foo(): > cdef int x = y > print x # type of x not known > > This is OK: > > if foo(): > cdef int x = y > else: > cdef int x = 4 > print x # ok, type the same anyway -- so type "escapes" block > > And I would allow > > cdef str x = y > if foo: > cdef int x = int(x) > return g(x) # x must be int > print x # x must be str at this point > > > The reason for this madness is simply that control statements do NOT create > blocks in Python, and making it so in Cython is just confusing. It would > bring too much of C into the language for my taste. I think the above examples (especially the last one) are a bit confusing as well. Introducing the notion of (implicit) block scoping is not very Pythonic. We would need something to be able to support local cdef classes, but I think a with statement is more appropriate for that as there's a notion of doing non-trivial work when exiting the block. > I think that in my Cython-utopia, Symtab.py is only responsible for > resolving the scope of *names*, and types of things are not bound to blocks, > just to the state at control flow points. > > Of course, implementing this would be a nightmare. > > >> Maybe the cleanup code from functions, as well as the temp handling >> etc could be re-factored to a BlockNode, that all block nodes could >> subclass. They'd have to instantiate new symbol table environments as >> well. I'm not yet entirely sure what else would be involved in the >> implementation of that. >> >>> But I like int[:] as a way of making it pure Python syntax compatible as >>> well. Perhaps the two are orthogonal -- a) make variable declaration a >>> statement, b) make cython.int[:](x) do, essentially, a cdef declaration, >>> for >>> Python compatability. >>> >> >> Don't we have cython.declare() for that? e.g. >> >> arr = cython.declare(cython.int[:]) >> >> That would also be treated as a statement like normal declarations (if >> and when implemented). > > > This was what I said, but it wasn't what I meant. Sorry. I'll try to explain > better: > > 1) There's no way to have the above actually do the right thing in Python. > With "arr = cython.int[:](arr)" one could actually return a NumPy or > NumPy-like array that works in Python (since "arr" might not have the > "shape" attribute before the conversion, all we know is that it exports the > buffer interface...). > > 2) I don't like the fact that we overload the assignment operator to acquire > a view. "cdef np.ndarray[int] x = y" is fine since if you do "x.someattr" > then a NumPy subclass could provide someattr and it works fine. Acquiring a > view is just something different. > > 3) Hence I guess I like "arr = int[:](arr)" better both for Cython and > Python; at least if "arr" is always type-inferred to be int[:], even if arr > was an "object" further up in the code (really, if you do "x = f(x)" at the > top-level of the function, then x can just take the identity of another > variable from that point on -- I don't know if the current control flow > analysis and type inferences does this though?) > > > Dag Sverre > ___ > cython-devel mailing list > cython-devel@python.org > http://mail.python.org/mailman/listinfo/cython-devel ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] OpenCL support
On Tue, Feb 7, 2012 at 9:58 AM, Sturla Molden wrote: > On 07.02.2012 18:22, Dimitri Tcaciuc wrote: > >> I'm not sure I understand you, maybe you could elaborate on that? > > > OpenCL code is a text string that is compiled when the program runs. So it > can be generated from run-time data. Think of it like dynamic HTML. > > >> Again, not sure what you mean here. As I mentioned in the thread, >> PyOpenCL worked quite fine, however if Cython is getting OpenCL >> support, I'd much rather use that than keeping a dependency on another >> library. > > > You can use PyOpenCL or OpenCL C or C++ headers with Cython. The latter you > just use as you would with any other C or C++ library. You don't need to > change the compiler to use a library: It seems like you think OpenCL is > compiled from code when you build the program. It is actually compiled from > text strings when you run the program. It is meaningless to ask if Cython > supports OpenCL because Cython supports any C library. I view this more as a proposal to have an OpenCL backend for prange loops and other vectorized operations. The advantage of integrating OpenCL into Cython is that one can write a single implementation of your algorithm (using traditional for...(p)range loops) and have it use the GPU in the background transparently (without having to manually learn and call the library yourself). This is analogous to the compiler/runtime system deciding to use sse instructions for a portion of your code because it thinks it will be faster. I really like the idea of decoupling the logic of the algorithm from the SIMD implementation (which is one of the reasons that prange, and in part OpenMP, works so well) but I think this is best done at the language level in our case. Whether OpenCL is mature enough/the abstractions are clean enough/the heuristics can be good enough to pull this off is another question, but it'd be great if it can be done (ideally with minimal impact to the language and isolated changes to the internals). - Robert ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] OpenCL support
On 02/05/2012 10:57 PM, mark florisson wrote: Hey, I created a CEP for opencl support: http://wiki.cython.org/enhancements/opencl What do you think? To start with my own conclusion on this, my feel is that it is too little gain, at least for a GPU solution. There's already Theano for trivial SIMD-stuff and PyOpenCL for the getting-hands-dirty stuff. (Of course, this CEP would be more convenient to use than Theano if one is already using Cython.) But that's just my feeling, and I'm not the one potentially signing up to do the work, so whether it is "worth it" is really not my decision, the weighing is done with your weights, not mine. Given an implementation, I definitely support the inclusion in Cython for these kind of features (FWIW). First, CPU: OpenCL is probably a very good way of portably making use of SSE/AVX etc. But to really get a payoff then I would think that the real value would be in *not* using OpenCL vector types, just many threads, so that the OpenCL driver does the dirty work of mapping each thread to each slot in the CPU registers? I'd think the gain in using OpenCL is to emit scalar code and leave the dirty work to OpenCL. If one does the hard part and mapped variables to vectors and memory accesses to shuffles, one might as well go the whole length and emit SSE/AVX rather than OpenCL to avoid the startup overhead. I don't really know how good the Intel and AMD CPU drivers are w.r.t. this -- I have seen the Intel driver emit "vectorizing" and "could not vectorize", but didn't explore the circumstances. Then, on to GPU: It is not a generic-purpose solution, you still need to bring in pyopencl for lots of cases, and so the question is how many cases it fits with and if it is enough to grow a userbase around it. And, importantly, how much performance is sacrificed for the resulting user-friendlyness. 50% performance hit is usually OK, 95% maybe not. And a 95% hit is not unimaginable if the memory movement is done in a bad way for some code? I think the fundamental problem is one of programming paradigms. Fortran, C++, Cython are all sequential in nature; even with OpenMP it is like you have a modest bit of parallelism tacked on to speed up a sequential-looking program. With "massively parallel" solutions such as CUDA and OpenCL, and also MPI in fact, the fundamental assumption that you have thousands or hundreds of thousands of threads. And that just changes how you need to think about writing code, which would tend to show up at a syntax level. So, at least if you want good performance, you need to change your way of thinking enough that a new syntax (loosely cooperating threads rather than parallel-for-loop or SIMD instruction) is actually an advantage, as it keeps you reminded of how the hardware works. So I think the most important thing to do (if you bother) is: Gather a set of real worl(-ish) CUDA or OpenCL programs, port them to Cython + this CEP (without a working Cython implementation for it), and see how that goes. That's really the only way to evaluate it. Some experiences from the single instance GPU code I've written: - For starters I had to give up OpenCL and use CUDA to use all the 48 KB available shared memory on Nvidia compute-capability-2.0 (perhaps I just didn't find the OpenCL option for that). And increasing from 16 to 48 KB allowed a fundamentally faster and qualitatively different algorithm to be used. But OpenCL vs. CUDA is kind of beside the point here - When mucking about with various "obvious" ports of sequential code to GPU code, I got performance in the range of 5 to 20 GFLOP/s (out of 490 GFLOP/s or so theoretical; NVidia Tesla M2050). When really understanding the hardware, and making good use of the 48 KB of thread-shared memory, I achieved 209 GFLOP/s, without really doing any microoptimization. I don't think the CEP includes any features for intra-thread communication, so that's off the table. (My code is here: https://github.com/wavemoth/wavemoth/blob/cuda/wavemoth/cuda/legendre_transform.cu.in Though it's badly documented and rush-for-deadline-quality; I plan to polish it up and publish it when I get time in autumn). I guess I mention this as the kind of computation your CEP definitely does NOT cover. That's probably OK, but one should figure out specifically how many usecases it does cover (in particular with no control over thread blocks and intra-block communication). Is the CEP a 80%-solution, or a 10%-solution? Dag Sverre ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] OpenCL support
On Wed, Feb 8, 2012 at 6:46 AM, Dag Sverre Seljebotn wrote: > On 02/05/2012 10:57 PM, mark florisson wrote: > > I don't really know how good the Intel and AMD CPU drivers are w.r.t. this > -- I have seen the Intel driver emit "vectorizing" and "could not > vectorize", but didn't explore the circumstances. For our project, we've tried both Intel and AMD (previously ATI) backends. The AMD experience somewhat mirrors what this developer described (http://www.msoos.org/2012/01/amds-opencl-heaven-and-hell/), although not as bad in terms of silent failures (or maybe I just havent caught any!). Intel backend was great and clearly better in terms of performance, sometimes by about 20-30%. However, when ran on older AMD-based machine as opposed to Intel one, the resulting kernel simply segfaulted without any warning about an unsupported architecture (I think its because it didn't have SSE3 support). > > Dag Sverre > > ___ > cython-devel mailing list > cython-devel@python.org > http://mail.python.org/mailman/listinfo/cython-devel I know Intel is working with LLVM/Clang folks to introduce their vectorization additions, at least to some degree, and LLVM seems to be consistently improving in this regard (eg http://blog.llvm.org/2011/12/llvm-31-vector-changes.html). I suppose if Cython emitted vectorization-friendly numerical loops, then appropriate C/C++ compiler should take care of this automatically, if used. Intel C++ can already do certain stuff like that (see http://software.intel.com/en-us/articles/a-guide-to-auto-vectorization-with-intel-c-compilers/), and GCC as well AFAIK. Dimitri. ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] OpenCL support
On 8 February 2012 14:46, Dag Sverre Seljebotn wrote: > On 02/05/2012 10:57 PM, mark florisson wrote: >> >> Hey, >> >> I created a CEP for opencl support: >> http://wiki.cython.org/enhancements/opencl >> What do you think? > > > To start with my own conclusion on this, my feel is that it is too little > gain, at least for a GPU solution. There's already Theano for trivial > SIMD-stuff and PyOpenCL for the getting-hands-dirty stuff. (Of course, this > CEP would be more convenient to use than Theano if one is already using > Cython.) Yes, vector operations and elemental or reduction functions operator on vectors (which is what we can use Theano for, right?) don't quite merit the use of OpenCL. However, the upside is that OpenCL allows easier vectorization and multi-threading. We can appease to auto-vectorizing compilers, but e.g. using OpenMP for multithreading will still segfault the program if used outside the main thread with gcc's implementation. I believe intel allows you to use it in any thread. (Of course, keeping a thread pool around and managing it manually isn't too hard, but...) > But that's just my feeling, and I'm not the one potentially signing up to do > the work, so whether it is "worth it" is really not my decision, the > weighing is done with your weights, not mine. Given an implementation, I > definitely support the inclusion in Cython for these kind of features > (FWIW). > > First, CPU: > > OpenCL is probably a very good way of portably making use of SSE/AVX etc. > But to really get a payoff then I would think that the real value would be > in *not* using OpenCL vector types, just many threads, so that the OpenCL > driver does the dirty work of mapping each thread to each slot in the CPU > registers? I'd think the gain in using OpenCL is to emit scalar code and > leave the dirty work to OpenCL. If one does the hard part and mapped > variables to vectors and memory accesses to shuffles, one might as well go > the whole length and emit SSE/AVX rather than OpenCL to avoid the startup > overhead. > > I don't really know how good the Intel and AMD CPU drivers are w.r.t. this > -- I have seen the Intel driver emit "vectorizing" and "could not > vectorize", but didn't explore the circumstances. > I initially thought the same thing, single kernel invocations should be trivially auto-vectorizable one would think. At least with Apple OpenCL I am getting better performance with vector types though on the CPU (up to 35%). I would personally consider emitting vector data types bonus points. But I don't quite agree that emitting SSE or AVX directly would be almost as easy in that case. You'd still have to detect at runtime which instruction set is supported and generate SSE, SSE2, (SSE4?) and AVX. And that's not even all of them :) The OpenCL drivers just hide that pain. With handwritten code you might be coding for a specific architecture and might be fine with only SSE2, but as a compiler we can't really make that same decision. > Then, on to GPU: > > It is not a generic-purpose solution, you still need to bring in pyopencl > for lots of cases, and so the question is how many cases it fits with and if > it is enough to grow a userbase around it. And, importantly, how much > performance is sacrificed for the resulting user-friendlyness. 50% > performance hit is usually OK, 95% maybe not. And a 95% hit is not > unimaginable if the memory movement is done in a bad way for some code? Yes, I don't expect this to change a lot suddenly. In the long term I think the implementation could be sufficiently good to support at least most codes. And the user still has full control over data movement, if wanted (the pinning thing, which isn't mentioned in the CEP). > I think the fundamental problem is one of programming paradigms. Fortran, > C++, Cython are all sequential in nature; even with OpenMP it is like you > have a modest bit of parallelism tacked on to speed up a sequential-looking > program. With "massively parallel" solutions such as CUDA and OpenCL, and > also MPI in fact, the fundamental assumption that you have thousands or > hundreds of thousands of threads. And that just changes how you need to > think about writing code, which would tend to show up at a syntax level. So, > at least if you want good performance, you need to change your way of > thinking enough that a new syntax (loosely cooperating threads rather than > parallel-for-loop or SIMD instruction) is actually an advantage, as it keeps > you reminded of how the hardware works. > > So I think the most important thing to do (if you bother) is: Gather a set > of real worl(-ish) CUDA or OpenCL programs, port them to Cython + this CEP > (without a working Cython implementation for it), and see how that goes. > That's really the only way to evaluate it. I've been wanting to do that for a long time now, also to evaluate the capabilities of cython.parallel as it stands now. It's a really good idea, I'll try to port some codes, and not just
Re: [Cython] OpenCL support
On 8 February 2012 17:35, Dimitri Tcaciuc wrote: > On Wed, Feb 8, 2012 at 6:46 AM, Dag Sverre Seljebotn > wrote: >> On 02/05/2012 10:57 PM, mark florisson wrote: >> >> I don't really know how good the Intel and AMD CPU drivers are w.r.t. this >> -- I have seen the Intel driver emit "vectorizing" and "could not >> vectorize", but didn't explore the circumstances. > > For our project, we've tried both Intel and AMD (previously ATI) > backends. The AMD experience somewhat mirrors what this developer > described (http://www.msoos.org/2012/01/amds-opencl-heaven-and-hell/), > although not as bad in terms of silent failures (or maybe I just > havent caught any!). > > Intel backend was great and clearly better in terms of performance, > sometimes by about 20-30%. However, when ran on older AMD-based > machine as opposed to Intel one, the resulting kernel simply > segfaulted without any warning about an unsupported architecture (I > think its because it didn't have SSE3 support). > >> >> Dag Sverre >> >> ___ >> cython-devel mailing list >> cython-devel@python.org >> http://mail.python.org/mailman/listinfo/cython-devel > > > I know Intel is working with LLVM/Clang folks to introduce their > vectorization additions, at least to some degree, and LLVM seems to be > consistently improving in this regard (eg > http://blog.llvm.org/2011/12/llvm-31-vector-changes.html). I suppose > if Cython emitted vectorization-friendly numerical loops, then > appropriate C/C++ compiler should take care of this automatically, if > used. Intel C++ can already do certain stuff like that (see > http://software.intel.com/en-us/articles/a-guide-to-auto-vectorization-with-intel-c-compilers/), > and GCC as well AFAIK. Indeed, native C (hopefully auto-vectorized whenever possible) is what we also hope to use (depending on heuristics). But what it doesn't give you is multithreading for the CPU (and e.g. the grand central dispatch on OS X). > Dimitri. > ___ > cython-devel mailing list > cython-devel@python.org > http://mail.python.org/mailman/listinfo/cython-devel ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] OpenCL support
On 02/08/2012 11:11 PM, mark florisson wrote: On 8 February 2012 14:46, Dag Sverre Seljebotn wrote: On 02/05/2012 10:57 PM, mark florisson wrote: Hey, I created a CEP for opencl support: http://wiki.cython.org/enhancements/opencl What do you think? To start with my own conclusion on this, my feel is that it is too little gain, at least for a GPU solution. There's already Theano for trivial SIMD-stuff and PyOpenCL for the getting-hands-dirty stuff. (Of course, this CEP would be more convenient to use than Theano if one is already using Cython.) Yes, vector operations and elemental or reduction functions operator on vectors (which is what we can use Theano for, right?) don't quite merit the use of OpenCL. However, the upside is that OpenCL allows easier vectorization and multi-threading. We can appease to auto-vectorizing compilers, but e.g. using OpenMP for multithreading will still segfault the program if used outside the main thread with gcc's implementation. I believe intel allows you to use it in any thread. (Of course, keeping a thread pool around and managing it manually isn't too hard, but...) But that's just my feeling, and I'm not the one potentially signing up to do the work, so whether it is "worth it" is really not my decision, the weighing is done with your weights, not mine. Given an implementation, I definitely support the inclusion in Cython for these kind of features (FWIW). First, CPU: OpenCL is probably a very good way of portably making use of SSE/AVX etc. But to really get a payoff then I would think that the real value would be in *not* using OpenCL vector types, just many threads, so that the OpenCL driver does the dirty work of mapping each thread to each slot in the CPU registers? I'd think the gain in using OpenCL is to emit scalar code and leave the dirty work to OpenCL. If one does the hard part and mapped variables to vectors and memory accesses to shuffles, one might as well go the whole length and emit SSE/AVX rather than OpenCL to avoid the startup overhead. I don't really know how good the Intel and AMD CPU drivers are w.r.t. this -- I have seen the Intel driver emit "vectorizing" and "could not vectorize", but didn't explore the circumstances. I initially thought the same thing, single kernel invocations should be trivially auto-vectorizable one would think. At least with Apple OpenCL I am getting better performance with vector types though on the CPU (up to 35%). I would personally consider emitting vector data types bonus points. But I don't quite agree that emitting SSE or AVX directly would be almost as easy in that case. You'd still have to detect at runtime which instruction set is supported and generate SSE, SSE2, (SSE4?) and AVX. And that's not even all of them :) The OpenCL drivers just hide that pain. With handwritten code you might be coding for a specific architecture and might be fine with only SSE2, but as a compiler we can't really make that same decision. You make good points. Then, on to GPU: It is not a generic-purpose solution, you still need to bring in pyopencl for lots of cases, and so the question is how many cases it fits with and if it is enough to grow a userbase around it. And, importantly, how much performance is sacrificed for the resulting user-friendlyness. 50% performance hit is usually OK, 95% maybe not. And a 95% hit is not unimaginable if the memory movement is done in a bad way for some code? Yes, I don't expect this to change a lot suddenly. In the long term I think the implementation could be sufficiently good to support at least most codes. And the user still has full control over data movement, if wanted (the pinning thing, which isn't mentioned in the CEP). I think the fundamental problem is one of programming paradigms. Fortran, C++, Cython are all sequential in nature; even with OpenMP it is like you have a modest bit of parallelism tacked on to speed up a sequential-looking program. With "massively parallel" solutions such as CUDA and OpenCL, and also MPI in fact, the fundamental assumption that you have thousands or hundreds of thousands of threads. And that just changes how you need to think about writing code, which would tend to show up at a syntax level. So, at least if you want good performance, you need to change your way of thinking enough that a new syntax (loosely cooperating threads rather than parallel-for-loop or SIMD instruction) is actually an advantage, as it keeps you reminded of how the hardware works. So I think the most important thing to do (if you bother) is: Gather a set of real worl(-ish) CUDA or OpenCL programs, port them to Cython + this CEP (without a working Cython implementation for it), and see how that goes. That's really the only way to evaluate it. I've been wanting to do that for a long time now, also to evaluate the capabilities of cython.parallel as it stands now. It's a really good idea, I'll try to port some codes, and not just the trivial ones like Jacobi's
Re: [Cython] OpenCL support
On 02/09/2012 12:15 AM, Dag Sverre Seljebotn wrote: On 02/08/2012 11:11 PM, mark florisson wrote: On 8 February 2012 14:46, Dag Sverre Seljebotn wrote: On 02/05/2012 10:57 PM, mark florisson wrote: Hey, I created a CEP for opencl support: http://wiki.cython.org/enhancements/opencl What do you think? To start with my own conclusion on this, my feel is that it is too little gain, at least for a GPU solution. There's already Theano for trivial SIMD-stuff and PyOpenCL for the getting-hands-dirty stuff. (Of course, this CEP would be more convenient to use than Theano if one is already using Cython.) Yes, vector operations and elemental or reduction functions operator on vectors (which is what we can use Theano for, right?) don't quite merit the use of OpenCL. However, the upside is that OpenCL allows easier vectorization and multi-threading. We can appease to auto-vectorizing compilers, but e.g. using OpenMP for multithreading will still segfault the program if used outside the main thread with gcc's implementation. I believe intel allows you to use it in any thread. (Of course, keeping a thread pool around and managing it manually isn't too hard, but...) But that's just my feeling, and I'm not the one potentially signing up to do the work, so whether it is "worth it" is really not my decision, the weighing is done with your weights, not mine. Given an implementation, I definitely support the inclusion in Cython for these kind of features (FWIW). First, CPU: OpenCL is probably a very good way of portably making use of SSE/AVX etc. But to really get a payoff then I would think that the real value would be in *not* using OpenCL vector types, just many threads, so that the OpenCL driver does the dirty work of mapping each thread to each slot in the CPU registers? I'd think the gain in using OpenCL is to emit scalar code and leave the dirty work to OpenCL. If one does the hard part and mapped variables to vectors and memory accesses to shuffles, one might as well go the whole length and emit SSE/AVX rather than OpenCL to avoid the startup overhead. I don't really know how good the Intel and AMD CPU drivers are w.r.t. this -- I have seen the Intel driver emit "vectorizing" and "could not vectorize", but didn't explore the circumstances. I initially thought the same thing, single kernel invocations should be trivially auto-vectorizable one would think. At least with Apple OpenCL I am getting better performance with vector types though on the CPU (up to 35%). I would personally consider emitting vector data types bonus points. But I don't quite agree that emitting SSE or AVX directly would be almost as easy in that case. You'd still have to detect at runtime which instruction set is supported and generate SSE, SSE2, (SSE4?) and AVX. And that's not even all of them :) The OpenCL drivers just hide that pain. With handwritten code you might be coding for a specific architecture and might be fine with only SSE2, but as a compiler we can't really make that same decision. You make good points. Then, on to GPU: It is not a generic-purpose solution, you still need to bring in pyopencl for lots of cases, and so the question is how many cases it fits with and if it is enough to grow a userbase around it. And, importantly, how much performance is sacrificed for the resulting user-friendlyness. 50% performance hit is usually OK, 95% maybe not. And a 95% hit is not unimaginable if the memory movement is done in a bad way for some code? Yes, I don't expect this to change a lot suddenly. In the long term I think the implementation could be sufficiently good to support at least most codes. And the user still has full control over data movement, if wanted (the pinning thing, which isn't mentioned in the CEP). I think the fundamental problem is one of programming paradigms. Fortran, C++, Cython are all sequential in nature; even with OpenMP it is like you have a modest bit of parallelism tacked on to speed up a sequential-looking program. With "massively parallel" solutions such as CUDA and OpenCL, and also MPI in fact, the fundamental assumption that you have thousands or hundreds of thousands of threads. And that just changes how you need to think about writing code, which would tend to show up at a syntax level. So, at least if you want good performance, you need to change your way of thinking enough that a new syntax (loosely cooperating threads rather than parallel-for-loop or SIMD instruction) is actually an advantage, as it keeps you reminded of how the hardware works. So I think the most important thing to do (if you bother) is: Gather a set of real worl(-ish) CUDA or OpenCL programs, port them to Cython + this CEP (without a working Cython implementation for it), and see how that goes. That's really the only way to evaluate it. I've been wanting to do that for a long time now, also to evaluate the capabilities of cython.parallel as it stands now. It's a really good idea, I'll try to port som