There has been lots of work on Parallel Prefix Sum on GPUs. The difference between implementations of Inclusive and Exclusive prefix sum is very minor, and both can be implemented efficiently. Focusing on work efficiency for GPUs is actually a distracting red herring, especially when you're using a fully parallel recursive parallel Prefix Sum, as you are doing. This is for two reasons: 1. GPUs are SIMD machines, so work efficiency as counted by the PRAM model does not always correlate to saving work on the actual hardware. 2. The work involved in synchronizing and coordinating threads is not counted as "work" in traditional analyses. On GPUs, floating-point operations are essentially free. Communication and coordination between parallel threads, however, is very expensive. Instead of doing a fully parallel recursive parallel prefix sum, it's much more efficient to sequentialize the sum as much as possible, expressing only enough parallelism to fill the chip.
It's also advantageous to take advantage of commutativity - have you defined whether your scan implementation accepts associative but non-commutative operators? If you require operators to be both associative and commutative, you can improve performance significantly. I wrote up some things about reductions (which are related to scan) here: http://developer.amd.com/documentation/articles/pages/opencl-optimization-case-study-simple-reductions.aspx As you can see, the "2-phase" reduction kernels are much faster than the "recursive multi-phase" kernels. Your implementation uses the recursive multi-phase approach. There are two analogues of the 2-phase reduction kernel for scan on an n element array with p processors (thread blocks), both of which require exactly 3 kernel launches, regardless of n or p: * Reduce, Scan, Scan: Divide the input into blocks of n/p elements. Perform reductions on these blocks in parallel, write out the sum of each block to a temporary array. Perform a scan on the temporary array (this should only require 1 thread block and one kernel, since there are only p elements in the temporary array). Perform a scan on the input, this time starting each block with the appropriate result from the scan. * Scan, Scan, Add: Scan the blocks in parallel, write the results out. Perform a scan on the first result from each block (this should only require 1 thread block and one kernel launch, since there are only p blocks. Broadcast the appropriate result from the partial scan and perform vector addition in each block. If you sequentialize these operations as much as possible, and do parallel reduction/scan trees only when absolutely necessary, you will save a lot of synchronization and communication cost. You can take a look at Thrust [1] to see an example of how this can be done. Best of luck, bryan [1] Thrust: http://code.google.com/p/thrust On Tue, Feb 22, 2011 at 1:47 AM, nithin s <[email protected]> wrote: > Hi Andreas > > On 22 February 2011 05:45, Andreas Kloeckner <[email protected]> wrote: >> Hi Nithin, >> >> two requests: >> >> - can you please resend this as an attachment? It's hard to fish out of >> the text of an email. > Done >> >> - please avoid using floating point functions (log, ceil, floor) in >> integer contexts. PyCUDA comes with a bitlog2 function that does what >> you need, I think. > > bitlog2 alone doesn't cut it. This is becase the log is taken to the > base 2*block_size. block_size need not be a power of 2 in a few rare > cases. This is because if shared mem is limited then the block_size = > shared_mem/item_size. Now Item size need not be a power of 2 (If we > are willing to support arbitrary types.. though there is a limitation > .. since dtype needs to be known for partial sum array > allocations..which is presumably numpy.dtype). > > This will mess up the estimate. I could recode this by writing a > routine by repeatedly dividing and calculating the necessary int ciel. > I feel the current expression is cleaner and concise. Let me know if > you still feel otherwise. > >> >> Once I get the file posted on the PyCUDA branch, I'll write a more >> complete review. I agree with your assessment of inclusive vs exclusive >> scan. I'd say go ahead and kill the inclusive version. >> >> Tomasz, what's your take here? >> >> Andreas >> >> > > Regards > > Nithin > > > @Bryan: Tomaszs' original inclusive scan impl was based on the naive > prefix scan algorithm at http://en.wikipedia.org/wiki/Prefix_sum. This > is not particularly work efficient. I don't (yet) see a neat way to > convert the Exclusive Mark Harris version to an inclusive one.Thus I > thought it better to maintain a single exclusive prefix scan. > > _______________________________________________ > PyCUDA mailing list > [email protected] > http://lists.tiker.net/listinfo/pycuda > > _______________________________________________ PyCUDA mailing list [email protected] http://lists.tiker.net/listinfo/pycuda
