On Mon, 14 Feb 2011 21:03:55 +0100, Tomasz Rybak <[email protected]> wrote: > Dnia 2011-02-13, nie o godzinie 19:12 -0500, Andreas Kloeckner pisze: > > On Mon, 14 Feb 2011 00:51:13 +0100, Tomasz Rybak <[email protected]> wrote: > > > After discussion with Martin Laprise I have come with the following code > > > (see attachment). It uses all available MPs, but I think it needs > > > some code to decide whether to use entire GPU (in case generated > > > vector is long) or only few blocks (otherwise). > > > > > > I can fix attached code to better suit PyCUDA style so you can push > > > it to git, and only then try to add code managing number of used blocks. > > > > Please work your changes into the branch I created. The changes there > > concerned (much) more than style. > > I have noticed - I like your solution. > BTW - you misspelled names of float2 and double2 CURAND functions; > I have fixed then in attached patch. > Also those functions (float2, double2) are available for XORWOW > generator, not for Sobol32 - unless I misunderstood purpose of variable > has_box_muller.
Whoops, looks like you're right. > > > > - The Sobol' direction vectors need to come from a very specific set to > > > > make sense, see curandGetDirectionVectors32 in the CURAND docs. We > > > > should probably call/wrap this function to get those vectors. Further, > > > > each generator should use a different vector, rather than the same > > > > one. > > > > > > > > - The Sobol' initialization needs to be worked out. In particular, I > > > > would like both generators to do something sensible if they're > > > > initialized without arguments. > > > > > > Agree on both points. > > > > Ok, sounds good. > > See attached patch. > > I have added field self.block_count that equals to number of MPs, > and it is number of blocks that are run for generating random numbers. > Should I try to play with it and use less blocks for shorter sequences, > or just leave it as is? I would prefer leaving as is ;-) ; for > smaller generated sequences kernels are executed quickly, so potential > performance gains could not be worth sophisticated code. I'd even go the opposite way and bump this 2-3 times the number of SMs. Optimizing for large arrays is fair, I think. If someone needs 15 random numbers, they're hardly going to come running to the GPU for them. Done in the revised version of your patch. > I have managed to use maximum number of threads on Tesla - during > initialisation I am just calling 2*blocks, each initialising > only half of generators that are used for one block. Test case > worked on ION, and sample program worked on Martin Laprise machine, > so I believe this is good solution. Sounds ok. On my Tesla, the double2 kernel runs into trouble for lack of registers. I've thus bumped down generators_per_block down by a factor of 2 from the maximum. > After you pull changes into git I will start working on seed_getters. Pulled with a few changes as detailed above, still on curand-wrapper-v2-from-tomasz branch. Andreas
pgpUTHIe8NEEL.pgp
Description: PGP signature
_______________________________________________ PyCUDA mailing list [email protected] http://lists.tiker.net/listinfo/pycuda
