On Mon, 14 Feb 2011 21:03:55 +0100, Tomasz Rybak <[email protected]> wrote:
> Dnia 2011-02-13, nie o godzinie 19:12 -0500, Andreas Kloeckner pisze:
> > On Mon, 14 Feb 2011 00:51:13 +0100, Tomasz Rybak <[email protected]> wrote:
> > > After discussion with Martin Laprise I have come with the following code
> > > (see attachment). It uses all available MPs, but I think it needs
> > > some code to decide whether to use entire GPU (in case generated
> > > vector is long) or only few blocks (otherwise).
> > > 
> > > I can fix attached code to better suit PyCUDA style so you can push
> > > it to git, and only then try to add code managing number of used blocks.
> > 
> > Please work your changes into the branch I created. The changes there
> > concerned (much) more than style.
> 
> I have noticed - I like your solution.
> BTW - you misspelled names of float2 and double2 CURAND functions;
> I have fixed then in attached patch.
> Also those functions (float2, double2) are available for XORWOW
> generator, not for Sobol32 - unless I misunderstood purpose of variable
> has_box_muller.

Whoops, looks like you're right.

> > > > - The Sobol' direction vectors need to come from a very specific set to
> > > >   make sense, see curandGetDirectionVectors32 in the CURAND docs. We
> > > >   should probably call/wrap this function to get those vectors. Further,
> > > >   each generator should use a different vector, rather than the same
> > > >   one.
> > > > 
> > > > - The Sobol' initialization needs to be worked out. In particular, I
> > > >   would like both generators to do something sensible if they're
> > > >   initialized without arguments.
> > > 
> > > Agree on both points. 
> > 
> > Ok, sounds good.
> 
> See attached patch.
> 
> I have added field self.block_count that equals to number of MPs,
> and it is number of blocks that are run for generating random numbers.
> Should I try to play with it and use less blocks for shorter sequences,
> or just leave it as is? I would prefer leaving as is ;-) ; for
> smaller generated sequences kernels are executed quickly, so potential
> performance gains could not be worth sophisticated code.

I'd even go the opposite way and bump this 2-3 times the number of
SMs. Optimizing for large arrays is fair, I think. If someone needs 15
random numbers, they're hardly going to come running to the GPU for
them.

Done in the revised version of your patch.

> I have managed to use maximum number of threads on Tesla - during
> initialisation I am just calling 2*blocks, each initialising
> only half of generators that are used for one block. Test case
> worked on ION, and sample program worked on Martin Laprise machine,
> so I believe this is good solution.

Sounds ok. On my Tesla, the double2 kernel runs into trouble for lack of
registers. I've thus bumped down generators_per_block down by a factor
of 2 from the maximum.

> After you pull changes into git I will start working on seed_getters.

Pulled with a few changes as detailed above, still on
curand-wrapper-v2-from-tomasz branch.

Andreas

Attachment: pgpUTHIe8NEEL.pgp
Description: PGP signature

_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Reply via email to