Thomas Wiecki <[email protected]> writes: > On Thu, Jun 7, 2012 at 11:50 AM, Andreas Kloeckner > <[email protected]>wrote: > >> >> If >> >> you're asking about the maximal number of threads the device can >> >> support (see above), there are good reasons to do smaller launches, as >> >> long as they still fill the machine. (and PyCUDA makes sure of that) >> > >> > What are those good reasons? >> >> There's some (small) overhead for switching thread blocks compared to >> just executing code within a block. So more blocks launched -> more of >> that overhead. The point is that CUDA pretends that there's an >> 'infinite' number of cores, and it's up to you to choose how many of >> those to use. Because of the (very slight) penalty, it's best not to >> stretch the illusion of 'infinitely many cores' too far if it's not >> necessary. (In fact, much of the overhead is in address computations and >> such, which can be amortized if there's just a single long for loop.) >> > > I see. In my case each item takes quite a while to compute so taking the > performance hit that comes with switching thread blocks is probably well > worth it.
Measure, don't guess. >> Check the code in pycuda.curandom for how it's used there. I'm certain >> this uses grid_size > 1, otherwise most of the machine would go unused. >> > > I think this is the relevant call: > > p.prepared_call((self.block_count, 1), (self.generators_per_block, 1, 1), > self.state, self.block_count * self.generators_per_block, seed.gpudata, > offset) > in ```XORWOWRandomNumberGenerator```. So if I read that correctly it inits > the blocks*threads generators, so the maximum number available. > > It seems that calling a kernel on an array that is larger than > threads_per_block*blocks is in general safe. The idx will just scale up so > that the correct elements can be accessed and somehow the execution seems > to get serialized to use the maximum number of threads. > > However, if I supply generator.state and use more threads than available, > this serializing will not work as the idx will try to access generators > outside of what's defined. I think this is what caused my problems before. > > The solution it seems is to use the for loop approach and then always call > the kernel like this: > > my_kernel(generator.state, out, block=(generator.generators_per_block, > 1, 1), grid=(generator.block_count, 1)) > > > That way I am sure I will never try to access uninitialized generators and > only use the for loop if I have to. > > Does that make sense? Yes. Andreas
pgpql2B5RUCPp.pgp
Description: PGP signature
_______________________________________________ PyCUDA mailing list [email protected] http://lists.tiker.net/listinfo/pycuda
