Re: [PyCUDA] Parallel prefix scan

Bryan Catanzaro Tue, 22 Feb 2011 09:55:10 -0800

There has been lots of work on Parallel Prefix Sum on GPUs. The
difference between implementations of Inclusive and Exclusive prefix
sum is very minor, and both can be implemented efficiently.
Focusing on work efficiency for GPUs is actually a distracting red
herring, especially when you're using a fully parallel recursive
parallel Prefix Sum, as you are doing.  This is for two reasons:
1.  GPUs are SIMD machines, so work efficiency as counted by the PRAM
model does not always correlate to saving work on the actual hardware.
2.  The work involved in synchronizing and coordinating threads is not
counted as "work" in traditional analyses.  On GPUs, floating-point
operations are essentially free.  Communication and coordination
between parallel threads, however, is very expensive.  Instead of
doing a fully parallel recursive parallel prefix sum, it's much more
efficient to sequentialize the sum as much as possible, expressing
only enough parallelism to fill the chip.

It's also advantageous to take advantage of commutativity - have you
defined whether your scan implementation accepts associative but
non-commutative operators?  If you require operators to be both
associative and commutative, you can improve performance
significantly.  I wrote up some things about reductions (which are
related to scan) here:
http://developer.amd.com/documentation/articles/pages/opencl-optimization-case-study-simple-reductions.aspx

As you can see, the "2-phase" reduction kernels are much faster than
the "recursive multi-phase" kernels.  Your implementation uses the
recursive multi-phase approach.  There are two analogues of the
2-phase reduction kernel for scan on an n element array with p
processors (thread blocks), both of which require exactly 3 kernel
launches, regardless of n or p:

* Reduce, Scan, Scan:  Divide the input into blocks of n/p elements.
Perform reductions on these blocks in parallel, write out the sum of
each block to a temporary array.  Perform a scan on the temporary
array (this should only require 1 thread block and one kernel, since
there are only p elements in the temporary array).  Perform a scan on
the input, this time starting each block with the appropriate result
from the scan.
* Scan, Scan, Add: Scan the blocks in parallel, write the results out.
 Perform a scan on the first result from each block (this should only
require 1 thread block and one kernel launch, since there are only p
blocks.  Broadcast the appropriate result from the partial scan and
perform vector addition in each block.

If you sequentialize these operations as much as possible, and do
parallel reduction/scan trees only when absolutely necessary, you will
save a lot of synchronization and communication cost.  You can take a
look at Thrust [1] to see an example of how this can be done.

Best of luck,
bryan

[1] Thrust: http://code.google.com/p/thrust

On Tue, Feb 22, 2011 at 1:47 AM, nithin s <[email protected]> wrote:
> Hi Andreas
>
> On 22 February 2011 05:45, Andreas Kloeckner <[email protected]> wrote:
>> Hi Nithin,
>>
>> two requests:
>>
>> - can you please resend this as an attachment? It's hard to fish out of
>>  the text of an email.
> Done
>>
>> - please avoid using floating point functions (log, ceil, floor) in
>>  integer contexts. PyCUDA comes with a bitlog2 function that does what
>>  you need, I think.
>
> bitlog2 alone doesn't cut it. This is becase the log is taken to the
> base 2*block_size. block_size need not be a power of 2 in a few rare
> cases. This is because if shared mem is limited then the block_size =
> shared_mem/item_size. Now Item size need not be a power of 2 (If we
> are willing to support arbitrary types.. though there is a limitation
> .. since dtype needs to be known for partial sum array
> allocations..which is presumably numpy.dtype).
>
> This will mess up the estimate. I could recode this by writing a
> routine by repeatedly dividing and calculating the necessary int ciel.
> I feel the current expression is cleaner and concise. Let me know if
> you still feel otherwise.
>
>>
>> Once I get the file posted on the PyCUDA branch, I'll write a more
>> complete review. I agree with your assessment of inclusive vs exclusive
>> scan. I'd say go ahead and kill the inclusive version.
>>
>> Tomasz, what's your take here?
>>
>> Andreas
>>
>>
>
> Regards
>
> Nithin
>
>
> @Bryan: Tomaszs' original inclusive scan impl was based on the naive
> prefix scan algorithm at http://en.wikipedia.org/wiki/Prefix_sum. This
> is not particularly work efficient. I don't (yet) see a neat way to
> convert the Exclusive Mark Harris version to an inclusive one.Thus I
> thought it better to maintain a single exclusive prefix scan.
>
> _______________________________________________
> PyCUDA mailing list
> [email protected]
> http://lists.tiker.net/listinfo/pycuda
>
>

_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Re: [PyCUDA] Parallel prefix scan

Reply via email to