On 8 February 2012 23:28, Dag Sverre Seljebotn <d.s.seljeb...@astro.uio.no> wrote: > On 02/09/2012 12:15 AM, Dag Sverre Seljebotn wrote: >> >> On 02/08/2012 11:11 PM, mark florisson wrote: >>> >>> On 8 February 2012 14:46, Dag Sverre Seljebotn >>> <d.s.seljeb...@astro.uio.no> wrote: >>>> >>>> On 02/05/2012 10:57 PM, mark florisson wrote: >>>>> >>>>> >>>>> Hey, >>>>> >>>>> I created a CEP for opencl support: >>>>> http://wiki.cython.org/enhancements/opencl >>>>> What do you think? >>>> >>>> >>>> >>>> To start with my own conclusion on this, my feel is that it is too >>>> little >>>> gain, at least for a GPU solution. There's already Theano for trivial >>>> SIMD-stuff and PyOpenCL for the getting-hands-dirty stuff. (Of >>>> course, this >>>> CEP would be more convenient to use than Theano if one is already using >>>> Cython.) >>> >>> >>> Yes, vector operations and elemental or reduction functions operator >>> on vectors (which is what we can use Theano for, right?) don't quite >>> merit the use of OpenCL. However, the upside is that OpenCL allows >>> easier vectorization and multi-threading. We can appease to >>> auto-vectorizing compilers, but e.g. using OpenMP for multithreading >>> will still segfault the program if used outside the main thread with >>> gcc's implementation. I believe intel allows you to use it in any >>> thread. (Of course, keeping a thread pool around and managing it >>> manually isn't too hard, but...) >>> >>>> But that's just my feeling, and I'm not the one potentially signing >>>> up to do >>>> the work, so whether it is "worth it" is really not my decision, the >>>> weighing is done with your weights, not mine. Given an implementation, I >>>> definitely support the inclusion in Cython for these kind of features >>>> (FWIW). >>>> >>>> First, CPU: >>>> >>>> OpenCL is probably a very good way of portably making use of SSE/AVX >>>> etc. >>>> But to really get a payoff then I would think that the real value >>>> would be >>>> in *not* using OpenCL vector types, just many threads, so that the >>>> OpenCL >>>> driver does the dirty work of mapping each thread to each slot in the >>>> CPU >>>> registers? I'd think the gain in using OpenCL is to emit scalar code and >>>> leave the dirty work to OpenCL. If one does the hard part and mapped >>>> variables to vectors and memory accesses to shuffles, one might as >>>> well go >>>> the whole length and emit SSE/AVX rather than OpenCL to avoid the >>>> startup >>>> overhead. >>>> >>>> I don't really know how good the Intel and AMD CPU drivers are w.r.t. >>>> this >>>> -- I have seen the Intel driver emit "vectorizing" and "could not >>>> vectorize", but didn't explore the circumstances. >>>> >>> >>> I initially thought the same thing, single kernel invocations should >>> be trivially auto-vectorizable one would think. At least with Apple >>> OpenCL I am getting better performance with vector types though on the >>> CPU (up to 35%). I would personally consider emitting vector data >>> types bonus points. >>> >>> But I don't quite agree that emitting SSE or AVX directly would be >>> almost as easy in that case. You'd still have to detect at runtime >>> which instruction set is supported and generate SSE, SSE2, (SSE4?) and >>> AVX. And that's not even all of them :) The OpenCL drivers just hide >>> that pain. With handwritten code you might be coding for a specific >>> architecture and might be fine with only SSE2, but as a compiler we >>> can't really make that same decision. >> >> >> You make good points. >> >>> >>>> Then, on to GPU: >>>> >>>> It is not a generic-purpose solution, you still need to bring in >>>> pyopencl >>>> for lots of cases, and so the question is how many cases it fits with >>>> and if >>>> it is enough to grow a userbase around it. And, importantly, how much >>>> performance is sacrificed for the resulting user-friendlyness. 50% >>>> performance hit is usually OK, 95% maybe not. And a 95% hit is not >>>> unimaginable if the memory movement is done in a bad way for some code? >>> >>> >>> Yes, I don't expect this to change a lot suddenly. In the long term I >>> think the implementation could be sufficiently good to support at >>> least most codes. And the user still has full control over data >>> movement, if wanted (the pinning thing, which isn't mentioned in the >>> CEP). >>> >>>> I think the fundamental problem is one of programming paradigms. >>>> Fortran, >>>> C++, Cython are all sequential in nature; even with OpenMP it is like >>>> you >>>> have a modest bit of parallelism tacked on to speed up a >>>> sequential-looking >>>> program. With "massively parallel" solutions such as CUDA and OpenCL, >>>> and >>>> also MPI in fact, the fundamental assumption that you have thousands or >>>> hundreds of thousands of threads. And that just changes how you need to >>>> think about writing code, which would tend to show up at a syntax >>>> level. So, >>>> at least if you want good performance, you need to change your way of >>>> thinking enough that a new syntax (loosely cooperating threads rather >>>> than >>>> parallel-for-loop or SIMD instruction) is actually an advantage, as >>>> it keeps >>>> you reminded of how the hardware works. >>>> >>>> So I think the most important thing to do (if you bother) is: Gather >>>> a set >>>> of real worl(-ish) CUDA or OpenCL programs, port them to Cython + >>>> this CEP >>>> (without a working Cython implementation for it), and see how that goes. >>>> That's really the only way to evaluate it. >>> >>> >>> I've been wanting to do that for a long time now, also to evaluate the >>> capabilities of cython.parallel as it stands now. It's a really good >>> idea, I'll try to port some codes, and not just the trivial ones like >>> Jacobi's method :). >>> >>>> Some experiences from the single instance GPU code I've written: >>>> >>>> - For starters I had to give up OpenCL and use CUDA to use all the 48 KB >>>> available shared memory on Nvidia compute-capability-2.0 (perhaps I just >>>> didn't find the OpenCL option for that). And increasing from 16 to 48 KB >>>> allowed a fundamentally faster and qualitatively different algorithm >>>> to be >>>> used. But OpenCL vs. CUDA is kind of beside the point here.... >>>> >>>> - When mucking about with various "obvious" ports of sequential code >>>> to GPU >>>> code, I got performance in the range of 5 to 20 GFLOP/s (out of 490 >>>> GFLOP/s >>>> or so theoretical; NVidia Tesla M2050). When really understanding the >>>> hardware, and making good use of the 48 KB of thread-shared memory, I >>>> achieved 209 GFLOP/s, without really doing any microoptimization. I >>>> don't >>>> think the CEP includes any features for intra-thread communication, so >>>> that's off the table. >>> >>> >>> The CEP doesn't mention barriers (discussed earlier), but they should >>> be supported, and __local memory (that's "shared memory" in CUDA terms >>> right?) could be utilized using a more explicit scheme (or implicitly >>> if the compiler is smart). The only issue with barriers is that with >>> OpenCL you have multiple levels of synchronization, but barriers only >>> work within the work group / thread block, whereas with openmp it >>> works simply for all your threads. I think a global barrier would have >>> to mean kernel termination + start of a new one, which could be hard >>> to support depending on where it is placed in the code... >>> >>>> (My code is here: >>>> >>>> >>>> https://github.com/wavemoth/wavemoth/blob/cuda/wavemoth/cuda/legendre_transform.cu.in >>>> >>>> >>>> Though it's badly documented and rush-for-deadline-quality; I plan to >>>> polish >>>> it up and publish it when I get time in autumn). >>>> >>>> I guess I mention this as the kind of computation your CEP definitely >>>> does >>>> NOT cover. That's probably OK, but one should figure out specifically >>>> how >>>> many usecases it does cover (in particular with no control over thread >>>> blocks and intra-block communication). Is the CEP a 80%-solution, or a >>>> 10%-solution? >>> >>> >>> I haven't looked too carefully at the code, but a large portion is >>> dedicated to a reduction right? What I don't see is how your reduction >>> spans across multiple work-groups / thread blocks? Because >>> __syncthreads should only sync stuff within a single block. The CEP > > > Most of the time there's actually no explicit synchronization, but the code > relies on all threads of a warp being on the same instruction in the > scheduler. __synchtreads is then only used at the end of the reduction when > all within-warp additions have been done. Calling __syncthreads at each step > of the algorithm would have totally killed performance. > > Dag Sverre >
Ah, clever. I don't think there's any way to figure out the warp size with OpenCL, but maybe if the user specifies it in some way similar optimizations can be made. >> >> There's no need to reduce across thread blocks because (conveniently >> enough) there's 8000 independent computations to be performed with >> different parameters. I simply used one thread block for each problem. >> >> It's basically a matrix-vector product where the matrix must be >> generated on the fly columnwise (one entry can be generated from the >> preceding two in the same column), but the summation is row-wise. >> >> And turns out that getting inter-thread sum-reduction to work well was >> harder than I expected; a 32-by-32 matrix (needed since warps are 32 >> threads) is too big to fit in memory, but tree-reduction makes a lot of >> the threads in a warp do nothing. So I ended up with a hybrid approach; >> there's visual demo from page 49 onwards here: >> >> http://folk.uio.no/dagss/talk-gpusht.pdf >> >> Getting back to Cython, I'll admit that this form of inter-thread >> reduction is quite generic, and that my specific problem could be solved >> by basically coding a set of inter-thread reduction algorithms suitable >> for different hardware into Cython. >> >>> didn't mention reductions, but they should be supported (I'm thinking >>> multi-stage or sequential within the workgroup (whatever works >>> better), followed by another kernel invocation if the result is >>> needed). >> >> >> Multiple kernel invocations for global barriers appear to be pretty >> standard, and it's why OpenCL support queueing tasks with dependencies >> etc. >> >>> As mentioned earlier in a different thread (on parallelism I think), >>> reduction arrays (i.e. memoryviews or C arrays) as well as generally >>> private arrays should be supported. An issue with that is that you >>> can't really dedicate an array to each work item / thread (too much >>> memory would be consumed). >>> >>> Again, declarations within blocks would solve many problems: >>> >>> cdef float[n] shared_by_work_group >>> with parallel(): >>> cdef float[n] local_to_work_group >>> for i in prange(...): >>> cdef float[n] local_to_work_item >>> >>> For arrays, the reductions could be somewhat more explicit, where >>> there is an explicit 'my_memoryview += my_local_scratch_data'. That >>> should probably only be allowed for memory local to the work group. >>> >>> Anyway, I'll try porting some numerical codes to this scheme over the >>> coming weeks and see what is missing and how it can be solved. I still >>> believe it can all be made to work quite properly, without adjusting >>> the language to fit the hardware model. The prange (and OpenMP) model >>> look like sequential code, but they tell the compiler a lot, namely >>> that each iteration is independent and could therefore be scheduled as >>> a separate thread. >> >> >> Again, good points. >> >> Dag >> _______________________________________________ >> cython-devel mailing list >> cython-devel@python.org >> http://mail.python.org/mailman/listinfo/cython-devel > > > _______________________________________________ > cython-devel mailing list > cython-devel@python.org > http://mail.python.org/mailman/listinfo/cython-devel _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel