Re: [Cython] OpenCL support

Dag Sverre Seljebotn Wed, 08 Feb 2012 15:28:44 -0800

On 02/09/2012 12:15 AM, Dag Sverre Seljebotn wrote:

On 02/08/2012 11:11 PM, mark florisson wrote:

On 8 February 2012 14:46, Dag Sverre Seljebotn
<d.s.seljeb...@astro.uio.no> wrote:

On 02/05/2012 10:57 PM, mark florisson wrote:


Hey,

I created a CEP for opencl support:
http://wiki.cython.org/enhancements/opencl
What do you think?



To start with my own conclusion on this, my feel is that it is too
little
gain, at least for a GPU solution. There's already Theano for trivial
SIMD-stuff and PyOpenCL for the getting-hands-dirty stuff. (Of
course, this
CEP would be more convenient to use than Theano if one is already using
Cython.)


Yes, vector operations and elemental or reduction functions operator
on vectors (which is what we can use Theano for, right?) don't quite
merit the use of OpenCL. However, the upside is that OpenCL allows
easier vectorization and multi-threading. We can appease to
auto-vectorizing compilers, but e.g. using OpenMP for multithreading
will still segfault the program if used outside the main thread with
gcc's implementation. I believe intel allows you to use it in any
thread. (Of course, keeping a thread pool around and managing it
manually isn't too hard, but...)

But that's just my feeling, and I'm not the one potentially signing
up to do
the work, so whether it is "worth it" is really not my decision, the
weighing is done with your weights, not mine. Given an implementation, I
definitely support the inclusion in Cython for these kind of features
(FWIW).

First, CPU:

OpenCL is probably a very good way of portably making use of SSE/AVX
etc.
But to really get a payoff then I would think that the real value
would be
in *not* using OpenCL vector types, just many threads, so that the
OpenCL
driver does the dirty work of mapping each thread to each slot in the
CPU
registers? I'd think the gain in using OpenCL is to emit scalar code and
leave the dirty work to OpenCL. If one does the hard part and mapped
variables to vectors and memory accesses to shuffles, one might as
well go
the whole length and emit SSE/AVX rather than OpenCL to avoid the
startup
overhead.

I don't really know how good the Intel and AMD CPU drivers are w.r.t.
this
-- I have seen the Intel driver emit "vectorizing" and "could not
vectorize", but didn't explore the circumstances.


I initially thought the same thing, single kernel invocations should
be trivially auto-vectorizable one would think. At least with Apple
OpenCL I am getting better performance with vector types though on the
CPU (up to 35%). I would personally consider emitting vector data
types bonus points.

But I don't quite agree that emitting SSE or AVX directly would be
almost as easy in that case. You'd still have to detect at runtime
which instruction set is supported and generate SSE, SSE2, (SSE4?) and
AVX. And that's not even all of them :) The OpenCL drivers just hide
that pain. With handwritten code you might be coding for a specific
architecture and might be fine with only SSE2, but as a compiler we
can't really make that same decision.


You make good points.

Then, on to GPU:

It is not a generic-purpose solution, you still need to bring in
pyopencl
for lots of cases, and so the question is how many cases it fits with
and if
it is enough to grow a userbase around it. And, importantly, how much
performance is sacrificed for the resulting user-friendlyness. 50%
performance hit is usually OK, 95% maybe not. And a 95% hit is not
unimaginable if the memory movement is done in a bad way for some code?


Yes, I don't expect this to change a lot suddenly. In the long term I
think the implementation could be sufficiently good to support at
least most codes. And the user still has full control over data
movement, if wanted (the pinning thing, which isn't mentioned in the
CEP).

I think the fundamental problem is one of programming paradigms.
Fortran,
C++, Cython are all sequential in nature; even with OpenMP it is like
you
have a modest bit of parallelism tacked on to speed up a
sequential-looking
program. With "massively parallel" solutions such as CUDA and OpenCL,
and
also MPI in fact, the fundamental assumption that you have thousands or
hundreds of thousands of threads. And that just changes how you need to
think about writing code, which would tend to show up at a syntax
level. So,
at least if you want good performance, you need to change your way of
thinking enough that a new syntax (loosely cooperating threads rather
than
parallel-for-loop or SIMD instruction) is actually an advantage, as
it keeps
you reminded of how the hardware works.

So I think the most important thing to do (if you bother) is: Gather
a set
of real worl(-ish) CUDA or OpenCL programs, port them to Cython +
this CEP
(without a working Cython implementation for it), and see how that goes.
That's really the only way to evaluate it.


I've been wanting to do that for a long time now, also to evaluate the
capabilities of cython.parallel as it stands now. It's a really good
idea, I'll try to port some codes, and not just the trivial ones like
Jacobi's method :).

Some experiences from the single instance GPU code I've written:

- For starters I had to give up OpenCL and use CUDA to use all the 48 KB
available shared memory on Nvidia compute-capability-2.0 (perhaps I just
didn't find the OpenCL option for that). And increasing from 16 to 48 KB
allowed a fundamentally faster and qualitatively different algorithm
to be
used. But OpenCL vs. CUDA is kind of beside the point here....

- When mucking about with various "obvious" ports of sequential code
to GPU
code, I got performance in the range of 5 to 20 GFLOP/s (out of 490
GFLOP/s
or so theoretical; NVidia Tesla M2050). When really understanding the
hardware, and making good use of the 48 KB of thread-shared memory, I
achieved 209 GFLOP/s, without really doing any microoptimization. I
don't
think the CEP includes any features for intra-thread communication, so
that's off the table.


The CEP doesn't mention barriers (discussed earlier), but they should
be supported, and __local memory (that's "shared memory" in CUDA terms
right?) could be utilized using a more explicit scheme (or implicitly
if the compiler is smart). The only issue with barriers is that with
OpenCL you have multiple levels of synchronization, but barriers only
work within the work group / thread block, whereas with openmp it
works simply for all your threads. I think a global barrier would have
to mean kernel termination + start of a new one, which could be hard
to support depending on where it is placed in the code...

(My code is here:

https://github.com/wavemoth/wavemoth/blob/cuda/wavemoth/cuda/legendre_transform.cu.in


Though it's badly documented and rush-for-deadline-quality; I plan to
polish
it up and publish it when I get time in autumn).

I guess I mention this as the kind of computation your CEP definitely
does
NOT cover. That's probably OK, but one should figure out specifically
how
many usecases it does cover (in particular with no control over thread
blocks and intra-block communication). Is the CEP a 80%-solution, or a
10%-solution?


I haven't looked too carefully at the code, but a large portion is
dedicated to a reduction right? What I don't see is how your reduction
spans across multiple work-groups / thread blocks? Because
__syncthreads should only sync stuff within a single block. The CEP

Most of the time there's actually no explicit synchronization, but thecode relies on all threads of a warp being on the same instruction inthe scheduler. __synchtreads is then only used at the end of thereduction when all within-warp additions have been done. Calling__syncthreads at each step of the algorithm would have totally killedperformance.


Dag Sverre


There's no need to reduce across thread blocks because (conveniently
enough) there's 8000 independent computations to be performed with
different parameters. I simply used one thread block for each problem.

It's basically a matrix-vector product where the matrix must be
generated on the fly columnwise (one entry can be generated from the
preceding two in the same column), but the summation is row-wise.

And turns out that getting inter-thread sum-reduction to work well was
harder than I expected; a 32-by-32 matrix (needed since warps are 32
threads) is too big to fit in memory, but tree-reduction makes a lot of
the threads in a warp do nothing. So I ended up with a hybrid approach;
there's visual demo from page 49 onwards here:

http://folk.uio.no/dagss/talk-gpusht.pdf

Getting back to Cython, I'll admit that this form of inter-thread
reduction is quite generic, and that my specific problem could be solved
by basically coding a set of inter-thread reduction algorithms suitable
for different hardware into Cython.

didn't mention reductions, but they should be supported (I'm thinking
multi-stage or sequential within the workgroup (whatever works
better), followed by another kernel invocation if the result is
needed).


Multiple kernel invocations for global barriers appear to be pretty
standard, and it's why OpenCL support queueing tasks with dependencies etc.

As mentioned earlier in a different thread (on parallelism I think),
reduction arrays (i.e. memoryviews or C arrays) as well as generally
private arrays should be supported. An issue with that is that you
can't really dedicate an array to each work item / thread (too much
memory would be consumed).

Again, declarations within blocks would solve many problems:

cdef float[n] shared_by_work_group
with parallel():
cdef float[n] local_to_work_group
for i in prange(...):
cdef float[n] local_to_work_item

For arrays, the reductions could be somewhat more explicit, where
there is an explicit 'my_memoryview += my_local_scratch_data'. That
should probably only be allowed for memory local to the work group.

Anyway, I'll try porting some numerical codes to this scheme over the
coming weeks and see what is missing and how it can be solved. I still
believe it can all be made to work quite properly, without adjusting
the language to fit the hardware model. The prange (and OpenMP) model
look like sequential code, but they tell the compiler a lot, namely
that each iteration is independent and could therefore be scheduled as
a separate thread.


Again, good points.

Dag
_______________________________________________
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


_______________________________________________
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel

Re: [Cython] OpenCL support

Reply via email to