Hi,
I saw a couple of times the following idiom being used:
const int tidx = blockIdx.x*blockDim.x + threadIdx.x;
const int delta = blockDim.x*gridDim.x;
curandState local_state = global_state[tidx];
for (int idx = tidx; idx < n; idx += delta)
{
out[idx] = compute_sth(in[idx])
}
I'm not sure I 100% understand what's going on but it is looping over
parts of the array spread dt apart. I think however in the case there
are enough threads available (n < max_threads) only one thread would
be doing all the work -- is that correct?
Wouldn't a better idiom do sth along the lines of:
for (int idx = tidx; idx < n; idx += max_threads)
thus if n < max_threads it would loop only once per thread and scale
up seamlessly. Am I missing something?
Any advice would be appreciated.
Thomas
_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda