Re: [Cython] memoryview slices can't be None?

2012-02-08 Thread Robert Bradshaw
On Sat, Feb 4, 2012 at 11:39 AM, Dag Sverre Seljebotn
 wrote:
>>
>> Block-local declarations are definitely something we want, although I
>> think it would require some more (non-trivial) changes to the
>> compiler.
>
>
> Note that my proposal was actually not about block-local declarations.
>
> Block-local:
>
> {
>   int x = 4;
> }
> /* x not available here */
>
> My idea was much more like hints to control flow analysis. That is, I wanted
> to have this raise an error:
>
> x = 'adf'
> if foo():
>    cdef int x = y
> print x # type of x not known
>
> This is OK:
>
> if foo():
>    cdef int x = y
> else:
>    cdef int x = 4
> print x # ok, type the same anyway -- so type "escapes" block
>
> And I would allow
>
> cdef str x = y
> if foo:
>    cdef int x = int(x)
>    return g(x) # x must be int
> print x # x must be str at this point
>
>
> The reason for this madness is simply that control statements do NOT create
> blocks in Python, and making it so in Cython is just confusing. It would
> bring too much of C into the language for my taste.

I think the above examples (especially the last one) are a bit
confusing as well. Introducing the notion of (implicit) block scoping
is not very Pythonic. We would need something to be able to support
local cdef classes, but I think a with statement is more appropriate
for that as there's a notion of doing non-trivial work when exiting
the block.

> I think that in my Cython-utopia, Symtab.py is only responsible for
> resolving the scope of *names*, and types of things are not bound to blocks,
> just to the state at control flow points.
>
> Of course, implementing this would be a nightmare.
>
>
>> Maybe the cleanup code from functions, as well as the temp handling
>> etc could be re-factored to a BlockNode, that all block nodes could
>> subclass. They'd have to instantiate new symbol table environments as
>> well. I'm not yet entirely sure what else would be involved in the
>> implementation of that.
>>
>>> But I like int[:] as a way of making it pure Python syntax compatible as
>>> well. Perhaps the two are orthogonal -- a) make variable declaration a
>>> statement, b) make cython.int[:](x) do, essentially, a cdef declaration,
>>> for
>>> Python compatability.
>>>
>>
>> Don't we have cython.declare() for that? e.g.
>>
>>     arr = cython.declare(cython.int[:])
>>
>> That would also be treated as a statement like normal declarations (if
>> and when implemented).
>
>
> This was what I said, but it wasn't what I meant. Sorry. I'll try to explain
> better:
>
> 1)  There's no way to have the above actually do the right thing in Python.
> With "arr = cython.int[:](arr)" one could actually return a NumPy or
> NumPy-like array that works in Python (since "arr" might not have the
> "shape" attribute before the conversion, all we know is that it exports the
> buffer interface...).
>
> 2) I don't like the fact that we overload the assignment operator to acquire
> a view. "cdef np.ndarray[int] x = y" is fine since if you do "x.someattr"
> then a NumPy subclass could provide someattr and it works fine. Acquiring a
> view is just something different.
>
> 3) Hence I guess I like "arr = int[:](arr)" better both for Cython and
> Python; at least if "arr" is always type-inferred to be int[:], even if arr
> was an "object" further up in the code (really, if you do "x = f(x)" at the
> top-level of the function, then x can just take the identity of another
> variable from that point on -- I don't know if the current control flow
> analysis and type inferences does this though?)
>
>
> Dag Sverre
> ___
> cython-devel mailing list
> cython-devel@python.org
> http://mail.python.org/mailman/listinfo/cython-devel
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] OpenCL support

2012-02-08 Thread Robert Bradshaw
On Tue, Feb 7, 2012 at 9:58 AM, Sturla Molden  wrote:
> On 07.02.2012 18:22, Dimitri Tcaciuc wrote:
>
>> I'm not sure I understand you, maybe you could elaborate on that?
>
>
> OpenCL code is a text string that is compiled when the program runs. So it
> can be generated from run-time data. Think of it like dynamic HTML.
>
>
>> Again, not sure what you mean here. As I mentioned in the thread,
>> PyOpenCL worked quite fine, however if Cython is getting OpenCL
>> support, I'd much rather use that than keeping a dependency on another
>> library.
>
>
> You can use PyOpenCL or OpenCL C or C++ headers with Cython. The latter you
> just use as you would with any other C or C++ library. You don't need to
> change the compiler to use a library: It seems like you think OpenCL is
> compiled from code when you build the program. It is actually compiled from
> text strings when you run the program. It is meaningless to ask if Cython
> supports OpenCL because Cython supports any C library.

I view this more as a proposal to have an OpenCL backend for prange
loops and other vectorized operations. The advantage of integrating
OpenCL into Cython is that one can write a single implementation of
your algorithm (using traditional for...(p)range loops) and have it
use the GPU in the background transparently (without having to
manually learn and call the library yourself). This is analogous to
the compiler/runtime system deciding to use sse instructions for a
portion of your code because it thinks it will be faster. I really
like the idea of decoupling the logic of the algorithm from the SIMD
implementation (which is one of the reasons that prange, and in part
OpenMP, works so well) but I think this is best done at the language
level in our case.

Whether OpenCL is mature enough/the abstractions are clean enough/the
heuristics can be good enough to pull this off is another question,
but it'd be great if it can be done (ideally with minimal impact to
the language and isolated changes to the internals).

- Robert
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] OpenCL support

2012-02-08 Thread Dag Sverre Seljebotn

On 02/05/2012 10:57 PM, mark florisson wrote:

Hey,

I created a CEP for opencl support: http://wiki.cython.org/enhancements/opencl
What do you think?


To start with my own conclusion on this, my feel is that it is too 
little gain, at least for a GPU solution. There's already Theano for 
trivial SIMD-stuff and PyOpenCL for the getting-hands-dirty stuff. (Of 
course, this CEP would be more convenient to use than Theano if one is 
already using Cython.)


But that's just my feeling, and I'm not the one potentially signing up 
to do the work, so whether it is "worth it" is really not my decision, 
the weighing is done with your weights, not mine. Given an 
implementation, I definitely support the inclusion in Cython for these 
kind of features (FWIW).


First, CPU:

OpenCL is probably a very good way of portably making use of SSE/AVX 
etc. But to really get a payoff then I would think that the real value 
would be in *not* using OpenCL vector types, just many threads, so that 
the OpenCL driver does the dirty work of mapping each thread to each 
slot in the CPU registers? I'd think the gain in using OpenCL is to emit 
scalar code and leave the dirty work to OpenCL. If one does the hard 
part and mapped variables to vectors and memory accesses to shuffles, 
one might as well go the whole length and emit SSE/AVX rather than 
OpenCL to avoid the startup overhead.


I don't really know how good the Intel and AMD CPU drivers are w.r.t. 
this -- I have seen the Intel driver emit "vectorizing" and "could not 
vectorize", but didn't explore the circumstances.



Then, on to GPU:

It is not a generic-purpose solution, you still need to bring in 
pyopencl for lots of cases, and so the question is how many cases it 
fits with and if it is enough to grow a userbase around it. And, 
importantly, how much performance is sacrificed for the resulting 
user-friendlyness. 50% performance hit is usually OK, 95% maybe not. And 
a 95% hit is not unimaginable if the memory movement is done in a bad 
way for some code?


I think the fundamental problem is one of programming paradigms. 
Fortran, C++, Cython are all sequential in nature; even with OpenMP it 
is like you have a modest bit of parallelism tacked on to speed up a 
sequential-looking program. With "massively parallel" solutions such as 
CUDA and OpenCL, and also MPI in fact, the fundamental assumption that 
you have thousands or hundreds of thousands of threads. And that just 
changes how you need to think about writing code, which would tend to 
show up at a syntax level. So, at least if you want good performance, 
you need to change your way of thinking enough that a new syntax 
(loosely cooperating threads rather than parallel-for-loop or SIMD 
instruction) is actually an advantage, as it keeps you reminded of how 
the hardware works.


So I think the most important thing to do (if you bother) is: Gather a 
set of real worl(-ish) CUDA or OpenCL programs, port them to Cython + 
this CEP (without a working Cython implementation for it), and see how 
that goes. That's really the only way to evaluate it.


Some experiences from the single instance GPU code I've written:

 - For starters I had to give up OpenCL and use CUDA to use all the 48 
KB available shared memory on Nvidia compute-capability-2.0 (perhaps I 
just didn't find the OpenCL option for that). And increasing from 16 to 
48 KB allowed a fundamentally faster and qualitatively different 
algorithm to be used. But OpenCL vs. CUDA is kind of beside the point 
here


 - When mucking about with various "obvious" ports of sequential code 
to GPU code, I got performance in the range of 5 to 20 GFLOP/s (out of 
490 GFLOP/s or so theoretical; NVidia Tesla M2050). When really 
understanding the hardware, and making good use of the 48 KB of 
thread-shared memory, I achieved 209 GFLOP/s, without really doing any 
microoptimization. I don't think the CEP includes any features for 
intra-thread communication, so that's off the table.


(My code is here:

https://github.com/wavemoth/wavemoth/blob/cuda/wavemoth/cuda/legendre_transform.cu.in

Though it's badly documented and rush-for-deadline-quality; I plan to 
polish it up and publish it when I get time in autumn).


I guess I mention this as the kind of computation your CEP definitely 
does NOT cover. That's probably OK, but one should figure out 
specifically how many usecases it does cover (in particular with no 
control over thread blocks and intra-block communication). Is the CEP a 
80%-solution, or a 10%-solution?


Dag Sverre
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] OpenCL support

2012-02-08 Thread Dimitri Tcaciuc
On Wed, Feb 8, 2012 at 6:46 AM, Dag Sverre Seljebotn
 wrote:
> On 02/05/2012 10:57 PM, mark florisson wrote:
>
> I don't really know how good the Intel and AMD CPU drivers are w.r.t. this
> -- I have seen the Intel driver emit "vectorizing" and "could not
> vectorize", but didn't explore the circumstances.

For our project, we've tried both Intel and AMD (previously ATI)
backends. The AMD experience somewhat mirrors what this developer
described (http://www.msoos.org/2012/01/amds-opencl-heaven-and-hell/),
although not as bad in terms of silent failures (or maybe I just
havent caught any!).

Intel backend was great and clearly better in terms of performance,
sometimes by about 20-30%. However, when ran on older AMD-based
machine as opposed to Intel one, the resulting kernel simply
segfaulted without any warning about an unsupported architecture (I
think its because it didn't have SSE3 support).

>
> Dag Sverre
>
> ___
> cython-devel mailing list
> cython-devel@python.org
> http://mail.python.org/mailman/listinfo/cython-devel


I know Intel is working with LLVM/Clang folks to introduce their
vectorization additions, at least to some degree, and LLVM seems to be
consistently improving in this regard (eg
http://blog.llvm.org/2011/12/llvm-31-vector-changes.html). I suppose
if Cython emitted vectorization-friendly numerical loops, then
appropriate C/C++ compiler should take care of this automatically, if
used. Intel C++ can already do certain stuff like that (see
http://software.intel.com/en-us/articles/a-guide-to-auto-vectorization-with-intel-c-compilers/),
and GCC as well AFAIK.

Dimitri.
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] OpenCL support

2012-02-08 Thread mark florisson
On 8 February 2012 14:46, Dag Sverre Seljebotn
 wrote:
> On 02/05/2012 10:57 PM, mark florisson wrote:
>>
>> Hey,
>>
>> I created a CEP for opencl support:
>> http://wiki.cython.org/enhancements/opencl
>> What do you think?
>
>
> To start with my own conclusion on this, my feel is that it is too little
> gain, at least for a GPU solution. There's already Theano for trivial
> SIMD-stuff and PyOpenCL for the getting-hands-dirty stuff. (Of course, this
> CEP would be more convenient to use than Theano if one is already using
> Cython.)

Yes, vector operations and elemental or reduction functions operator
on vectors (which is what we can use Theano for, right?) don't quite
merit the use of OpenCL. However, the upside is that OpenCL allows
easier vectorization and multi-threading. We can appease to
auto-vectorizing compilers, but e.g. using OpenMP for multithreading
will still segfault the program if used outside the main thread with
gcc's implementation. I believe intel allows you to use it in any
thread. (Of course, keeping a thread pool around and managing it
manually isn't too hard, but...)

> But that's just my feeling, and I'm not the one potentially signing up to do
> the work, so whether it is "worth it" is really not my decision, the
> weighing is done with your weights, not mine. Given an implementation, I
> definitely support the inclusion in Cython for these kind of features
> (FWIW).
>
> First, CPU:
>
> OpenCL is probably a very good way of portably making use of SSE/AVX etc.
> But to really get a payoff then I would think that the real value would be
> in *not* using OpenCL vector types, just many threads, so that the OpenCL
> driver does the dirty work of mapping each thread to each slot in the CPU
> registers? I'd think the gain in using OpenCL is to emit scalar code and
> leave the dirty work to OpenCL. If one does the hard part and mapped
> variables to vectors and memory accesses to shuffles, one might as well go
> the whole length and emit SSE/AVX rather than OpenCL to avoid the startup
> overhead.
>
> I don't really know how good the Intel and AMD CPU drivers are w.r.t. this
> -- I have seen the Intel driver emit "vectorizing" and "could not
> vectorize", but didn't explore the circumstances.
>

I initially thought the same thing, single kernel invocations should
be trivially auto-vectorizable one would think. At least with Apple
OpenCL I am getting better performance with vector types though on the
CPU (up to 35%). I would personally consider emitting vector data
types bonus points.

But I don't quite agree that emitting SSE or AVX directly would be
almost as easy in that case. You'd still have to detect at runtime
which instruction set is supported and generate SSE, SSE2, (SSE4?) and
AVX. And that's not even all of them :) The OpenCL drivers just hide
that pain. With handwritten code you might be coding for a specific
architecture and might be fine with only SSE2, but as a compiler we
can't really make that same decision.

> Then, on to GPU:
>
> It is not a generic-purpose solution, you still need to bring in pyopencl
> for lots of cases, and so the question is how many cases it fits with and if
> it is enough to grow a userbase around it. And, importantly, how much
> performance is sacrificed for the resulting user-friendlyness. 50%
> performance hit is usually OK, 95% maybe not. And a 95% hit is not
> unimaginable if the memory movement is done in a bad way for some code?

Yes, I don't expect this to change a lot suddenly. In the long term I
think the implementation could be sufficiently good to support at
least most codes. And the user still has full control over data
movement, if wanted (the pinning thing, which isn't mentioned in the
CEP).

> I think the fundamental problem is one of programming paradigms. Fortran,
> C++, Cython are all sequential in nature; even with OpenMP it is like you
> have a modest bit of parallelism tacked on to speed up a sequential-looking
> program. With "massively parallel" solutions such as CUDA and OpenCL, and
> also MPI in fact, the fundamental assumption that you have thousands or
> hundreds of thousands of threads. And that just changes how you need to
> think about writing code, which would tend to show up at a syntax level. So,
> at least if you want good performance, you need to change your way of
> thinking enough that a new syntax (loosely cooperating threads rather than
> parallel-for-loop or SIMD instruction) is actually an advantage, as it keeps
> you reminded of how the hardware works.
>
> So I think the most important thing to do (if you bother) is: Gather a set
> of real worl(-ish) CUDA or OpenCL programs, port them to Cython + this CEP
> (without a working Cython implementation for it), and see how that goes.
> That's really the only way to evaluate it.

I've been wanting to do that for a long time now, also to evaluate the
capabilities of cython.parallel as it stands now. It's a really good
idea, I'll try to port some codes, and not just

Re: [Cython] OpenCL support

2012-02-08 Thread mark florisson
On 8 February 2012 17:35, Dimitri Tcaciuc  wrote:
> On Wed, Feb 8, 2012 at 6:46 AM, Dag Sverre Seljebotn
>  wrote:
>> On 02/05/2012 10:57 PM, mark florisson wrote:
>>
>> I don't really know how good the Intel and AMD CPU drivers are w.r.t. this
>> -- I have seen the Intel driver emit "vectorizing" and "could not
>> vectorize", but didn't explore the circumstances.
>
> For our project, we've tried both Intel and AMD (previously ATI)
> backends. The AMD experience somewhat mirrors what this developer
> described (http://www.msoos.org/2012/01/amds-opencl-heaven-and-hell/),
> although not as bad in terms of silent failures (or maybe I just
> havent caught any!).
>
> Intel backend was great and clearly better in terms of performance,
> sometimes by about 20-30%. However, when ran on older AMD-based
> machine as opposed to Intel one, the resulting kernel simply
> segfaulted without any warning about an unsupported architecture (I
> think its because it didn't have SSE3 support).
>
>>
>> Dag Sverre
>>
>> ___
>> cython-devel mailing list
>> cython-devel@python.org
>> http://mail.python.org/mailman/listinfo/cython-devel
>
>
> I know Intel is working with LLVM/Clang folks to introduce their
> vectorization additions, at least to some degree, and LLVM seems to be
> consistently improving in this regard (eg
> http://blog.llvm.org/2011/12/llvm-31-vector-changes.html). I suppose
> if Cython emitted vectorization-friendly numerical loops, then
> appropriate C/C++ compiler should take care of this automatically, if
> used. Intel C++ can already do certain stuff like that (see
> http://software.intel.com/en-us/articles/a-guide-to-auto-vectorization-with-intel-c-compilers/),
> and GCC as well AFAIK.

Indeed, native C (hopefully auto-vectorized whenever possible) is what
we also hope to use (depending on heuristics). But what it doesn't
give you is multithreading for the CPU (and e.g. the grand central
dispatch on OS X).

> Dimitri.
> ___
> cython-devel mailing list
> cython-devel@python.org
> http://mail.python.org/mailman/listinfo/cython-devel
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] OpenCL support

2012-02-08 Thread Dag Sverre Seljebotn

On 02/08/2012 11:11 PM, mark florisson wrote:

On 8 February 2012 14:46, Dag Sverre Seljebotn
  wrote:

On 02/05/2012 10:57 PM, mark florisson wrote:


Hey,

I created a CEP for opencl support:
http://wiki.cython.org/enhancements/opencl
What do you think?



To start with my own conclusion on this, my feel is that it is too little
gain, at least for a GPU solution. There's already Theano for trivial
SIMD-stuff and PyOpenCL for the getting-hands-dirty stuff. (Of course, this
CEP would be more convenient to use than Theano if one is already using
Cython.)


Yes, vector operations and elemental or reduction functions operator
on vectors (which is what we can use Theano for, right?) don't quite
merit the use of OpenCL. However, the upside is that OpenCL allows
easier vectorization and multi-threading. We can appease to
auto-vectorizing compilers, but e.g. using OpenMP for multithreading
will still segfault the program if used outside the main thread with
gcc's implementation. I believe intel allows you to use it in any
thread. (Of course, keeping a thread pool around and managing it
manually isn't too hard, but...)


But that's just my feeling, and I'm not the one potentially signing up to do
the work, so whether it is "worth it" is really not my decision, the
weighing is done with your weights, not mine. Given an implementation, I
definitely support the inclusion in Cython for these kind of features
(FWIW).

First, CPU:

OpenCL is probably a very good way of portably making use of SSE/AVX etc.
But to really get a payoff then I would think that the real value would be
in *not* using OpenCL vector types, just many threads, so that the OpenCL
driver does the dirty work of mapping each thread to each slot in the CPU
registers? I'd think the gain in using OpenCL is to emit scalar code and
leave the dirty work to OpenCL. If one does the hard part and mapped
variables to vectors and memory accesses to shuffles, one might as well go
the whole length and emit SSE/AVX rather than OpenCL to avoid the startup
overhead.

I don't really know how good the Intel and AMD CPU drivers are w.r.t. this
-- I have seen the Intel driver emit "vectorizing" and "could not
vectorize", but didn't explore the circumstances.



I initially thought the same thing, single kernel invocations should
be trivially auto-vectorizable one would think. At least with Apple
OpenCL I am getting better performance with vector types though on the
CPU (up to 35%). I would personally consider emitting vector data
types bonus points.

But I don't quite agree that emitting SSE or AVX directly would be
almost as easy in that case. You'd still have to detect at runtime
which instruction set is supported and generate SSE, SSE2, (SSE4?) and
AVX. And that's not even all of them :) The OpenCL drivers just hide
that pain. With handwritten code you might be coding for a specific
architecture and might be fine with only SSE2, but as a compiler we
can't really make that same decision.


You make good points.




Then, on to GPU:

It is not a generic-purpose solution, you still need to bring in pyopencl
for lots of cases, and so the question is how many cases it fits with and if
it is enough to grow a userbase around it. And, importantly, how much
performance is sacrificed for the resulting user-friendlyness. 50%
performance hit is usually OK, 95% maybe not. And a 95% hit is not
unimaginable if the memory movement is done in a bad way for some code?


Yes, I don't expect this to change a lot suddenly. In the long term I
think the implementation could be sufficiently good to support at
least most codes. And the user still has full control over data
movement, if wanted (the pinning thing, which isn't mentioned in the
CEP).


I think the fundamental problem is one of programming paradigms. Fortran,
C++, Cython are all sequential in nature; even with OpenMP it is like you
have a modest bit of parallelism tacked on to speed up a sequential-looking
program. With "massively parallel" solutions such as CUDA and OpenCL, and
also MPI in fact, the fundamental assumption that you have thousands or
hundreds of thousands of threads. And that just changes how you need to
think about writing code, which would tend to show up at a syntax level. So,
at least if you want good performance, you need to change your way of
thinking enough that a new syntax (loosely cooperating threads rather than
parallel-for-loop or SIMD instruction) is actually an advantage, as it keeps
you reminded of how the hardware works.

So I think the most important thing to do (if you bother) is: Gather a set
of real worl(-ish) CUDA or OpenCL programs, port them to Cython + this CEP
(without a working Cython implementation for it), and see how that goes.
That's really the only way to evaluate it.


I've been wanting to do that for a long time now, also to evaluate the
capabilities of cython.parallel as it stands now. It's a really good
idea, I'll try to port some codes, and not just the trivial ones like
Jacobi's

Re: [Cython] OpenCL support

2012-02-08 Thread Dag Sverre Seljebotn

On 02/09/2012 12:15 AM, Dag Sverre Seljebotn wrote:

On 02/08/2012 11:11 PM, mark florisson wrote:

On 8 February 2012 14:46, Dag Sverre Seljebotn
 wrote:

On 02/05/2012 10:57 PM, mark florisson wrote:


Hey,

I created a CEP for opencl support:
http://wiki.cython.org/enhancements/opencl
What do you think?



To start with my own conclusion on this, my feel is that it is too
little
gain, at least for a GPU solution. There's already Theano for trivial
SIMD-stuff and PyOpenCL for the getting-hands-dirty stuff. (Of
course, this
CEP would be more convenient to use than Theano if one is already using
Cython.)


Yes, vector operations and elemental or reduction functions operator
on vectors (which is what we can use Theano for, right?) don't quite
merit the use of OpenCL. However, the upside is that OpenCL allows
easier vectorization and multi-threading. We can appease to
auto-vectorizing compilers, but e.g. using OpenMP for multithreading
will still segfault the program if used outside the main thread with
gcc's implementation. I believe intel allows you to use it in any
thread. (Of course, keeping a thread pool around and managing it
manually isn't too hard, but...)


But that's just my feeling, and I'm not the one potentially signing
up to do
the work, so whether it is "worth it" is really not my decision, the
weighing is done with your weights, not mine. Given an implementation, I
definitely support the inclusion in Cython for these kind of features
(FWIW).

First, CPU:

OpenCL is probably a very good way of portably making use of SSE/AVX
etc.
But to really get a payoff then I would think that the real value
would be
in *not* using OpenCL vector types, just many threads, so that the
OpenCL
driver does the dirty work of mapping each thread to each slot in the
CPU
registers? I'd think the gain in using OpenCL is to emit scalar code and
leave the dirty work to OpenCL. If one does the hard part and mapped
variables to vectors and memory accesses to shuffles, one might as
well go
the whole length and emit SSE/AVX rather than OpenCL to avoid the
startup
overhead.

I don't really know how good the Intel and AMD CPU drivers are w.r.t.
this
-- I have seen the Intel driver emit "vectorizing" and "could not
vectorize", but didn't explore the circumstances.



I initially thought the same thing, single kernel invocations should
be trivially auto-vectorizable one would think. At least with Apple
OpenCL I am getting better performance with vector types though on the
CPU (up to 35%). I would personally consider emitting vector data
types bonus points.

But I don't quite agree that emitting SSE or AVX directly would be
almost as easy in that case. You'd still have to detect at runtime
which instruction set is supported and generate SSE, SSE2, (SSE4?) and
AVX. And that's not even all of them :) The OpenCL drivers just hide
that pain. With handwritten code you might be coding for a specific
architecture and might be fine with only SSE2, but as a compiler we
can't really make that same decision.


You make good points.




Then, on to GPU:

It is not a generic-purpose solution, you still need to bring in
pyopencl
for lots of cases, and so the question is how many cases it fits with
and if
it is enough to grow a userbase around it. And, importantly, how much
performance is sacrificed for the resulting user-friendlyness. 50%
performance hit is usually OK, 95% maybe not. And a 95% hit is not
unimaginable if the memory movement is done in a bad way for some code?


Yes, I don't expect this to change a lot suddenly. In the long term I
think the implementation could be sufficiently good to support at
least most codes. And the user still has full control over data
movement, if wanted (the pinning thing, which isn't mentioned in the
CEP).


I think the fundamental problem is one of programming paradigms.
Fortran,
C++, Cython are all sequential in nature; even with OpenMP it is like
you
have a modest bit of parallelism tacked on to speed up a
sequential-looking
program. With "massively parallel" solutions such as CUDA and OpenCL,
and
also MPI in fact, the fundamental assumption that you have thousands or
hundreds of thousands of threads. And that just changes how you need to
think about writing code, which would tend to show up at a syntax
level. So,
at least if you want good performance, you need to change your way of
thinking enough that a new syntax (loosely cooperating threads rather
than
parallel-for-loop or SIMD instruction) is actually an advantage, as
it keeps
you reminded of how the hardware works.

So I think the most important thing to do (if you bother) is: Gather
a set
of real worl(-ish) CUDA or OpenCL programs, port them to Cython +
this CEP
(without a working Cython implementation for it), and see how that goes.
That's really the only way to evaluate it.


I've been wanting to do that for a long time now, also to evaluate the
capabilities of cython.parallel as it stands now. It's a really good
idea, I'll try to port som