Hi Nithin, On Mon, 21 Feb 2011 22:27:04 +0530, nithin s <[email protected]> wrote: > I believe there are some errors in the implementation. Im > basing my comments only on the exclusive version. > > The final call to finish adds the "each" of the partial sums to > every element of the result. That is to say that if my array size was > 1024x1024 and each thread block worked on 1024 elements. My partial > sum array would be as large as 1024 and the last(or second to last) > block would have to iterate 1024 sums to produce the result. > > Isn't this wrong? shouldn't the partial sums be prefix scanned > and then each block adds the associated partial sum o/p to each of its > elements. That way the loop for (int i = 1; i <= blockIdx.x; i++) is > not needed.
We know it's broken at the moment--that's why it's currently living on a branch and not in mainline PyCUDA yet. Patches welcome. Andreas
pgpH1krkmSjrS.pgp
Description: PGP signature
_______________________________________________ PyCUDA mailing list [email protected] http://lists.tiker.net/listinfo/pycuda
