Hi Nithin,

On Mon, 21 Feb 2011 22:27:04 +0530, nithin s <[email protected]> wrote:
>        I believe there are some errors in the implementation. Im
> basing my comments only on the exclusive version.
> 
>       The final call to finish adds the "each" of the partial sums to
> every element of the result. That is to say that if my array size was
> 1024x1024 and each thread block worked on 1024 elements. My partial
> sum array would be as large as 1024 and the last(or second to last)
> block would have to iterate 1024 sums to produce the result.
> 
>      Isn't this wrong? shouldn't the partial sums be prefix scanned
> and then each block adds the associated partial sum o/p to each of its
> elements. That way the loop for (int i = 1; i <= blockIdx.x; i++) is
> not needed.

We know it's broken at the moment--that's why it's currently living on a
branch and not in mainline PyCUDA yet. Patches welcome.

Andreas

Attachment: pgpH1krkmSjrS.pgp
Description: PGP signature

_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Reply via email to