Dnia 2011-02-14, pon o godzinie 22:58 -0500, Andreas Kloeckner pisze:
> On Mon, 14 Feb 2011 21:03:55 +0100, Tomasz Rybak <[email protected]> wrote:
> > Dnia 2011-02-13, nie o godzinie 19:12 -0500, Andreas Kloeckner pisze:
> > > On Mon, 14 Feb 2011 00:51:13 +0100, Tomasz Rybak <[email protected]> wrote:
> > > > After discussion with Martin Laprise I have come with the following code
> > > > (see attachment). It uses all available MPs, but I think it needs
> > > > some code to decide whether to use entire GPU (in case generated
> > > > vector is long) or only few blocks (otherwise).
> > > > 
> > > > I can fix attached code to better suit PyCUDA style so you can push
> > > > it to git, and only then try to add code managing number of used blocks.
> > > 
> > > Please work your changes into the branch I created. The changes there
> > > concerned (much) more than style.
> > 
> > I have noticed - I like your solution.
> > BTW - you misspelled names of float2 and double2 CURAND functions;
> > I have fixed then in attached patch.
> > Also those functions (float2, double2) are available for XORWOW
> > generator, not for Sobol32 - unless I misunderstood purpose of variable
> > has_box_muller.
> 
> Whoops, looks like you're right.
> 
> > > > > - The Sobol' direction vectors need to come from a very specific set 
> > > > > to
> > > > >   make sense, see curandGetDirectionVectors32 in the CURAND docs. We
> > > > >   should probably call/wrap this function to get those vectors. 
> > > > > Further,
> > > > >   each generator should use a different vector, rather than the same
> > > > >   one.
> > > > > 
> > > > > - The Sobol' initialization needs to be worked out. In particular, I
> > > > >   would like both generators to do something sensible if they're
> > > > >   initialized without arguments.
> > > > 
> > > > Agree on both points. 
> > > 
> > > Ok, sounds good.
> > 
> > See attached patch.
> > 
> > I have added field self.block_count that equals to number of MPs,
> > and it is number of blocks that are run for generating random numbers.
> > Should I try to play with it and use less blocks for shorter sequences,
> > or just leave it as is? I would prefer leaving as is ;-) ; for
> > smaller generated sequences kernels are executed quickly, so potential
> > performance gains could not be worth sophisticated code.
> 
> I'd even go the opposite way and bump this 2-3 times the number of
> SMs. Optimizing for large arrays is fair, I think. If someone needs 15
> random numbers, they're hardly going to come running to the GPU for
> them.
> 
> Done in the revised version of your patch.

I disagree here. IMO it makes no sense to use more blocks than there
is SMs, as it introduces burden of switching blocks. In case of my code
there is no switching between blocks - SM gets block to execute,
executes kernel generating random numbers, finishes. After your change
SM gets block, executes it, gets another block, ..., finishes.

Each thread already generates multiple random numbers in the loop.
After your change it just loops less times than in my code.

Time for generating 100 000 000 floats on GF104:
using 3*SMs: 0.0315589904785
using 1*SMs: 0.0291240215302
Those times are repeatable - for 3x I get 0.031, for 1x I get 0.029.

So please - revert to previous state (just apply attached patch).

> 
> > I have managed to use maximum number of threads on Tesla - during
> > initialisation I am just calling 2*blocks, each initialising
> > only half of generators that are used for one block. Test case
> > worked on ION, and sample program worked on Martin Laprise machine,
> > so I believe this is good solution.
> 
> Sounds ok. On my Tesla, the double2 kernel runs into trouble for lack of
> registers. I've thus bumped down generators_per_block down by a factor
> of 2 from the maximum.

OK, but do not punish Fermi for lacks of Tesla; I added test and use
half threads only on Tesla. Fermi should still use maximum number
of threads. 

> 
> > After you pull changes into git I will start working on seed_getters.
> 
> Pulled with a few changes as detailed above, still on
> curand-wrapper-v2-from-tomasz branch.
> 
> Andreas
> 


-- 
Tomasz Rybak <[email protected]> GPG/PGP key ID: 2AD5 9860
Fingerprint A481 824E 7DD3 9C0E C40A  488E C654 FB33 2AD5 9860
http://member.acm.org/~tomaszrybak
diff --git a/pycuda/curandom.py b/pycuda/curandom.py
index 8da2a35..324e5ce 100644
--- a/pycuda/curandom.py
+++ b/pycuda/curandom.py
@@ -311,10 +311,12 @@ class _RandomNumberGeneratorBase(object):
 
         dev = drv.Context.get_device()
 
+        block_size = dev.get_attribute(
+            drv.device_attribute.MAX_THREADS_PER_BLOCK)
         # On GT200, double2 normal kernel runs into register limits.
         # Try to stay clear of those by using half the number possible.
-        block_size = dev.get_attribute(
-            drv.device_attribute.MAX_THREADS_PER_BLOCK) // 2
+        if dev.compute_capability() < (2, 0):
+            block_size = block_size // 2
         block_dimension =  dev.get_attribute(
             drv.device_attribute.MAX_BLOCK_DIM_X)
         self.generators_per_block = min(block_size, block_dimension)
@@ -322,7 +324,7 @@ class _RandomNumberGeneratorBase(object):
         # generators_per_block is divided by 2 below
         assert self.generators_per_block % 2 == 0
 
-        self.block_count = 3*dev.get_attribute(
+        self.block_count = dev.get_attribute(
             pycuda.driver.device_attribute.MULTIPROCESSOR_COUNT)
 
         from pycuda.characterize import sizeof

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Reply via email to