Re: [PyCUDA] CURAND wrappers

Tomasz Rybak Tue, 18 Jan 2011 11:22:21 -0800

Dnia 2011-01-17, pon o godzinie 10:59 -0500, Martin Laprise pisze:
> ok ... problem solved:
> 
> git apply --ignore-whitespace curand.diff
> 
> On Fri, Jan 14, 2011 at 3:48 PM, Martin Laprise <[email protected]>wrote:
> 
> > HI all,
> >
> > Thanks you for this very useful wrappers. I'm looking forward to try it.
> > Unfortunately, when I try to apply the patch, I've got the following error:
> >
> > git apply curand.diff
> > curand.diff:9: trailing whitespace.
> > .. warning::
> > curand.diff:10: trailing whitespace.
> >
> > curand.diff:11: trailing whitespace.
> >     The following classes are using random number generators that run on
> > GPU.
> > curand.diff:12: trailing whitespace.
> >     Each thread uses own generator. Creation of those generators requires
> > more
> > curand.diff:13: trailing whitespace.
> >     resources than subsequent generation of random numbers. After
> > experiments
> > error: patch failed: doc/source/array.rst:345
> > error: doc/source/array.rst: patch does not apply
> > error: patch failed: pycuda/curandom.py:233
> > error: pycuda/curandom.py: patch does not apply
> >
> > On which version of the trunk should I apply the patch ?
> >
> > Martin
> >


Sorry for troubles with patch - I use "patch -p1 <curand.diff"
instead of "git apply", and that's why I have not noticed
trailing whitespaces.
I am sending fixed patch now; it applies without any warning.

-- 
Tomasz Rybak <[email protected]> GPG/PGP key ID: 2AD5 9860
Fingerprint A481 824E 7DD3 9C0E C40A  488E C654 FB33 2AD5 9860
http://member.acm.org/~tomaszrybak

diff --git a/doc/source/array.rst b/doc/source/array.rst
index 5273df9..7acabba 100644
--- a/doc/source/array.rst
+++ b/doc/source/array.rst
@@ -345,6 +345,157 @@ Generating Arrays of Random Numbers
     Return an array of `shape` filled with random values of `dtype`
     in the range [0,1).
 
+.. warning::
+
+    The following classes are using random number generators that run on GPU.
+    Each thread uses own generator. Creation of those generators requires more
+    resources than subsequent generation of random numbers. After experiments
+    it looks like maximum number of active generators on Tesla devices
+    (with compute capabilities 1.x) is 256. Fermi devices allow for creating
+    1024 generators without any problems. If there are troubles with creating
+    objects of class PseudoRandomNumberGenerator or QuasiRandomNumberGenerator
+    decrease number of created generators
+    (and therefore number of active threads).
+
+.. class:: PseudoRandomNumberGenerator(offset, seed=None)
+
+    Class surrounding CURAND kernels from CUDA 3.2.
+    It allows for generating pseudo-random numbers with uniform and normal
+    probability function of type int, float, double, float2, double2.
+    Uses curandState to store random generator state and generates
+    sequences with period at least 2^190.
+
+    CUDA 3.2 and above.
+
+    .. method fill_uniform_int(data, input_size, stream=None)
+
+        Fills in array of input_size integer values pointed by data
+	with pseudo-random integers with uniform distribution.
+
+    .. method fill_uniform_float(data, input_size, stream=None)
+
+        Fills in array of input_size float values pointed by data
+	with pseudo-random floats with uniform distribution.
+
+    .. method fill_uniform_double(data, input_size, stream=None)
+
+        Fills in array of input_size double values pointed by data
+	with pseudo-random floats with uniform distribution.
+
+    .. method fill_normal_float(data, input_size, stream=None)
+
+        Fills in array of input_size float values pointed by data
+	with pseudo-random floats with normal distribution.
+
+    .. method fill_normal_double(data, input_size, stream=None)
+
+        Fills in array of input_size double values pointed by data
+	with pseudo-random double with normal distribution.
+
+    .. method fill_normal_float2(data, input_size, stream=None)
+
+        Fills in array of input_size float pairs pointed by data
+	with pseudo-random floats with normal distribution.
+
+    .. method fill_normal_double2(data, input_size, stream=None)
+
+        Fills in array of input_size double pairs pointed by data
+	with pseudo-random double with normal distribution.
+
+    .. method fill_uniform(data, stream=None)
+
+        Fills in GPUArray with pseudo-random values with uniform distribution.
+	Detects type of items in GPUArray and calls appropriate method.
+
+    .. method fill_normal(data, stream=None)
+
+        Fills in GPUArray with pseudo-random values with normal distribution.
+	Detects type of items in GPUArray and calls appropriate method.
+
+    .. method call_skip_ahead(i, stream=None)
+
+        Forces all generators to skip i values. Is equivalent to generating
+	i values and discarding results, but is much faster.
+
+    .. method call_skip_ahead_array(i, stream=None)
+
+        Accepts array i of integer values, telling each generator how many
+	values to skip.
+
+    .. method __call__(shape, dtype=numpy.float32, stream=None)
+
+        Creates GPUArray, fills it with pseudo-random values and returns it.
+
+.. class:: QuasiRandomNumberGenerator(vector, offset)
+
+    Class surrounding CURAND kernels from CUDA 3.2.
+    It allows for generating quasi-random numbers with uniform and normal
+    probability function of type int, float, double, float2, double2.
+    Uses curandStateSobol32 to store random generator state and generates
+    sequences with period of 2^32.
+
+    CUDA 3.2 and above.
+
+    .. method fill_uniform_int(data, input_size, stream=None)
+
+        Fills in array of input_size integer values pointed by data
+	with quasi-random integers with uniform distribution.
+
+    .. method fill_uniform_float(data, input_size, stream=None)
+
+        Fills in array of input_size float values pointed by data
+	with quasi-random floats with uniform distribution.
+
+    .. method fill_uniform_double(data, input_size, stream=None)
+
+        Fills in array of input_size double values pointed by data
+	with quasi-random floats with uniform distribution.
+
+    .. method fill_normal_float(data, input_size, stream=None)
+
+        Fills in array of input_size float values pointed by data
+	with quasi-random floats with normal distribution.
+
+    .. method fill_normal_double(data, input_size, stream=None)
+
+        Fills in array of input_size double values pointed by data
+	with quasi-random double with normal distribution.
+
+    .. method fill_normal_float2(data, input_size, stream=None)
+
+        Fills in array of input_size float pairs pointed by data
+	with quasi-random floats with normal distribution.
+
+    .. method fill_normal_double2(data, input_size, stream=None)
+
+        Fills in array of input_size double pairs pointed by data
+	with quasi-random double with normal distribution.
+
+    .. method fill_uniform(data, stream=None)
+
+        Fills in GPUArray with quasi-random values with uniform distribution.
+	Detects type of items in GPUArray and calls appropriate method.
+
+    .. method fill_normal(data, stream=None)
+
+        Fills in GPUArray with quasi-random values with normal distribution.
+	Detects type of items in GPUArray and calls appropriate method.
+
+    .. method call_skip_ahead(i, stream=None)
+
+        Forces all generators to skip i values. Is equivalent to generating
+	i values and discarding results, but is much faster.
+
+    .. method call_skip_ahead_array(i, stream=None)
+
+        Accepts array i of integer values, telling each generator how many
+	values to skip.
+
+    .. method __call__(shape, dtype=numpy.float32, stream=None)
+
+        Creates GPUArray, fills it with quasi-random values and returns it.
+
+
 Single-pass Custom Expression Evaluation
 ----------------------------------------
 
diff --git a/pycuda/curandom.py b/pycuda/curandom.py
index 2105f40..14d704e 100644
--- a/pycuda/curandom.py
+++ b/pycuda/curandom.py
@@ -233,6 +233,274 @@ def rand(shape, dtype=numpy.float32, stream=None):
     return result
 
 
+# Need to allocate memory for curandState - there in not enough memory
+# (neither shared nor private) on the chip for one for each thread ;-)
+# sizeof(curandState))
+curand_state_size = 40
+# sizeof(curandStateSobol32))
+curand_state_sobol_size = 136
+
+random_source = """
+// Uses C++ features (templates); do not surround with extern C
+#include <curand_kernel.h>
+
+extern "C" {
+__global__ void uniform_int(%(data_type)s *s, int *d, const int n) {
+    const int tidx = threadIdx.x;
+    for (int idx = tidx; idx < n; idx += blockDim.x) {
+        d[idx] = curand(&s[tidx]);
+    }
+}
+
+__global__ void uniform_float(%(data_type)s *s, float *d, const int n) {
+    const int tidx = threadIdx.x;
+    for (int idx = tidx; idx < n; idx += blockDim.x) {
+        d[idx] = curand_uniform(&s[tidx]);
+    }
+}
+
+__global__ void uniform_double(%(data_type)s *s, double *d, const int n) {
+    const int tidx = threadIdx.x;
+    for (int idx = tidx; idx < n; idx += blockDim.x) {
+        d[idx] = curand_uniform_double(&s[tidx]);
+    }
+}
+
+__global__ void normal_float(%(data_type)s *s, float *d, const int n) {
+    const int tidx = threadIdx.x;
+    for (int idx = tidx; idx < n; idx += blockDim.x) {
+        d[idx] = curand_normal(&s[tidx]);
+    }
+}
+
+__global__ void normal_double(%(data_type)s *s, double *d, const int n) {
+    const int tidx = threadIdx.x;
+    for (int idx = tidx; idx < n; idx += blockDim.x) {
+        d[idx] = curand_normal_double(&s[tidx]);
+    }
+}
+
+
+__global__ void skip_ahead(%(data_type)s *s, const int n, const int skip) {
+    const int idx = threadIdx.x;
+    if (idx < n) {
+        skipahead(skip, &s[idx]);
+    }
+}
+
+__global__ void skip_ahead_array(%(data_type)s *s, const int n, const int *skip) {
+    const int idx = threadIdx.x;
+    if (idx < n) {
+        skipahead(skip[idx], &s[idx]);
+    }
+}
+}
+"""
+
+class RandomNumberGenerator(object):
+    """
+    Class surrounding CURAND kernels from CUDA 3.2.
+    It allows for generating random numbers with uniform
+    and normal probability function of various types.
+    """
+    def __init__(self, data_type, data_type_size, additional_source):
+        import pycuda.compiler
+        import pycuda.driver
+        if pycuda.driver.get_version() < (3, 2, 0):
+            raise EnvironmentError("Need at least CUDA 3.2")
+# Max 256 threads on ION during preparing
+# Need to limit number of threads. If not there is failed execution:
+# pycuda._driver.LaunchError: cuLaunchGrid failed: launch out of resources
+        dev = pycuda.driver.Context.get_device()
+        if dev.compute_capability() >= (2, 0):
+            block_size = dev.get_attribute(pycuda.driver.device_attribute.MAX_THREADS_PER_BLOCK)
+            block_dimension =  dev.get_attribute(pycuda.driver.device_attribute.MAX_BLOCK_DIM_X)
+            self.generator_count = min(block_size, block_dimension)
+	else:
+            self.generator_count = 256
+# TODO: cudaThreadSetLimit(cudaLimitStackSize, 16k) on Fermi
+        self.state = pycuda.driver.mem_alloc(self.generator_count *
+            data_type_size)
+        source = str(random_source + additional_source) % {
+            'data_type': data_type,
+            }
+        self.module = pycuda.compiler.SourceModule(source, no_extern_c=True,
+            keep=True)
+        self.uniform_int = self.module.get_function("uniform_int")
+        self.uniform_int.prepare("PPi", block=(self.generator_count, 1, 1))
+        self.uniform_float = self.module.get_function("uniform_float")
+        self.uniform_float.prepare("PPi", block=(self.generator_count, 1, 1))
+        self.uniform_double = self.module.get_function("uniform_double")
+        self.uniform_double.prepare("PPi", block=(self.generator_count, 1, 1))
+        self.normal_float = self.module.get_function("normal_float")
+        self.normal_float.prepare("PPi", block=(self.generator_count, 1, 1))
+        self.normal_double = self.module.get_function("normal_double")
+        self.normal_double.prepare("PPi", block=(self.generator_count, 1, 1))
+        self.skip_ahead = self.module.get_function("skip_ahead")
+        self.skip_ahead.prepare("Pii", block=(self.generator_count, 1, 1))
+        self.skip_ahead_array = self.module.get_function("skip_ahead_array")
+        self.skip_ahead_array.prepare("PiP", block=(self.generator_count, 1, 1))
+    def __del__(self):
+        self.state.free()
+    def fill_uniform_int(self, data, input_size, stream = None):
+        self.uniform_int.prepared_async_call((1, 1), stream, self.state,
+            data, input_size)
+    def fill_uniform_float(self, data, input_size, stream = None):
+        self.uniform_float.prepared_async_call((1, 1), stream, self.state,
+            data, input_size)
+    def fill_uniform_double(self, data, input_size, stream = None):
+        self.uniform_double.prepared_async_call((1, 1), stream, self.state,
+            data, input_size)
+    def fill_normal_float(self, data, input_size, stream = None):
+        self.normal_float.prepared_async_call((1, 1), stream, self.state,
+            data, input_size)
+    def fill_normal_double(self, data, input_size, stream = None):
+        self.normal_double.prepared_async_call((1, 1), stream, self.state,
+            data, input_size)
+    def fill_uniform(self, data, stream = None):
+        if data.dtype == numpy.float32:
+            self.fill_uniform_float(data.gpudata, data.size, stream)
+        elif data.dtype == numpy.float64:
+            self.fill_uniform_double(data.gpudata, data.size, stream)
+        elif data.dtype in [numpy.int, numpy.int32, numpy.uint32]:
+            self.fill_uniform_int(data.gpudata, data.size, stream)
+        else:
+            raise NotImplementedError
+    def fill_normal(self, data, stream = None):
+        if isinstance(data.dtype, numpy.float32):
+            self.fill_normal_float(data.gpudata, data.size, stream)
+        elif isinstance(data.dtype, numpy.float64):
+            self.fill_normal_double(data.gpudata, data.size, stream)
+        else:
+            raise NotImplementedError
+    def call_skip_ahead(self, i, stream = None):
+        self.skip_ahead.prepared_async_call((1, 1), stream, self.state,
+            self.generator_count, i)
+    def call_skip_ahead_array(self, i, stream = None):
+        self.skip_ahead_array.prepared_async_call((1, 1), stream, self.state,
+            self.generator_count, i.gpudata)
+    def __call__(self, shape, dtype = numpy.float32, stream = None):
+        import pycuda.gpuarray
+        result = pycuda.gpuarray.GPUArray(shape, dtype)
+        self.fill_uniform(result, stream)
+        return result
+
+pseudo_random_source = """
+extern "C" {
+__global__ void prepare_with_seed(curandState *s, const int n, const int seed, const int offset) {
+    if (threadIdx.x < n) {
+        curand_init(seed, threadIdx.x, offset, &s[threadIdx.x]);
+    }
+}
+
+__global__ void prepare_with_seeds(curandState *s, const int n, const int *seed, const int offset) {
+    if (threadIdx.x < n) {
+        curand_init(seed[threadIdx.x], threadIdx.x, offset,
+            &s[threadIdx.x]);
+    }
+}
+
+__global__ void normal_float2(curandState *s, float2 *d, const int n) {
+    const int tidx = threadIdx.x;
+    for (int idx = tidx; idx < n; idx += blockDim.x) {
+        d[idx] = curand_normal2(&s[tidx]);
+    }
+}
+
+__global__ void normal_double2(curandState *s, double2 *d, const int n) {
+    const int tidx = threadIdx.x;
+    for (int idx = tidx; idx < n; idx += blockDim.x) {
+        d[idx] = curand_normal2_double(&s[tidx]);
+    }
+}
+}
+"""
+
+class PseudoRandomNumberGenerator(RandomNumberGenerator):
+    """
+    Class surrounding CURAND kernels from CUDA 3.2.
+    It allows for generating pseudo-random numbers with uniform
+    and normal probability function of type int, float, double,
+    float2, double2.
+    """
+    def __init__(self, offset, seed = None):
+        import pycuda._driver
+        import pycuda.gpuarray
+        super(PseudoRandomNumberGenerator, self).__init__('curandState',
+            curand_state_size, pseudo_random_source)
+        self.normal_float2 = self.module.get_function("normal_float2")
+        self.normal_float2.prepare("PPi", block=(self.generator_count, 1, 1))
+        self.normal_double2 = self.module.get_function("normal_double2")
+        self.normal_double2.prepare("PPi", block=(self.generator_count, 1, 1))
+# Initialise
+        if seed is None:
+            import random
+            import sys
+            seed = random.randint(0, ((1 << 31) - 1))
+        if isinstance(seed, int):
+            p = self.module.get_function("prepare_with_seed")
+            p.prepare("Piii", block=(self.generator_count, 1, 1))
+            try:
+                p.prepared_call((1, 1), self.state, self.generator_count, seed,
+                    offset)
+            except pycuda._driver.LaunchError:
+                raise ValueError("Initialisation failed. Decrease number of threads.")
+        else:
+            if isinstance(seed, list) or isinstance(seed, tuple):
+                seed = numpy.array(seed, dtype=numpy.int32)
+            if isinstance(seed, numpy.ndarray):
+                seed = pycuda.gpuarray.to_gpu(seed).astype(numpy.int32)
+            if (isinstance(seed, pycuda.gpuarray.GPUArray) and
+                seed.dtype == numpy.int32):
+                p = self.module.get_function("prepare_with_seeds")
+                p.prepare("PiPi", block=(self.generator_count, 1, 1))
+                try:
+                    p.prepared_call((1, 1), self.state, self.generator_count,
+                        seed.gpudata, offset)
+                except pycuda._driver.LaunchError:
+                    raise ValueError("Initialisation failed. Decrease number of threads.")
+            else:
+                raise ValueError("Need GPUArray of integers")
+    def fill_normal_float2(self, data, input_size, stream = None):
+        self.normal_float2.prepared_async_call((1, 1), stream, self.state,
+            data, input_size)
+    def fill_normal_double2(self, data, input_size, stream = None):
+        self.normal_double2.prepared_async_call((1, 1), stream, self.state,
+            data, input_size)
+
+quasi_random_source = """
+// Uses C++ features (templates); do not surround with extern C
+#include <curand_kernel.h>
+
+extern "C" {
+__global__ void prepare(curandStateSobol32 *s, const int n, unsigned int *v,
+    const unsigned int o) {
+    if (threadIdx.x < n) {
+        curand_init(v, o, &s[threadIdx.x]);
+    }
+}
+}
+"""
+
+class QuasiRandomNumberGenerator(RandomNumberGenerator):
+    """
+    Class surrounding CURAND kernels from CUDA 3.2.
+    It allows for generating quasi-random numbers with uniform
+    and normal probability function of type int, float, and double.
+    """
+    def __init__(self, vector, offset):
+        import pycuda._driver
+        super(QuasiRandomNumberGenerator, self).__init__('curandStateSobol32',
+            curand_state_sobol_size, quasi_random_source)
+        p = self.module.get_function("prepare")
+        p.prepare("PiPi", block=(self.generator_count, 1, 1))
+        try:
+            p.prepared_call((1, 1), self.state, self.generator_count, vector,
+                offset)
+        except pycuda._driver.LaunchError:
+            raise ValueError("Initialisation failed. Decrease number of threads.")
+
+
 if __name__ == "__main__":
     import sys, pycuda.autoinit
 
@@ -255,3 +523,16 @@ if __name__ == "__main__":
         subplot(132); plot( r2.get(),"x-")
         subplot(133); plot( r3.get(),"x-")
         show()
+
+        import pycuda.gpuarray
+        rr = PseudoRandomNumberGenerator(0, numpy.random.random(256).astype(numpy.int32))
+        data = pycuda.gpuarray.zeros([10000], numpy.float32)
+        print data[0:200]
+        rr.fill_uniform_float(data.gpudata, 512)
+        print data[0:200]
+        data = rr((100, ), numpy.uint32)
+        del rr
+        print data[0:200]
+
+        data = pycuda.gpuarray.zeros([256], numpy.int)
+        rr = QuasiRandomNumberGenerator(data.gpudata, 1)

signature.asc
Description: This is a digitally signed message part

_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Re: [PyCUDA] CURAND wrappers

Reply via email to