Hi,

I saw a couple of times the following idiom being used:

        const int tidx = blockIdx.x*blockDim.x + threadIdx.x;
        const int delta = blockDim.x*gridDim.x;

        curandState local_state = global_state[tidx];

        for (int idx = tidx; idx < n; idx += delta)
        {
             out[idx] = compute_sth(in[idx])
        }

I'm not sure I 100% understand what's going on but it is looping over
parts of the array spread dt apart. I think however in the case there
are enough threads available (n < max_threads) only one thread would
be doing all the work -- is that correct?

Wouldn't a better idiom do sth along the lines of:

for (int idx = tidx; idx < n; idx += max_threads)

thus if n < max_threads it would loop only once per thread and scale
up seamlessly. Am I missing something?

Any advice would be appreciated.

Thomas

_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Reply via email to