Dnia 2010-12-26, nie o godzinie 17:42 +0100, Andreas Kloeckner pisze: > Hi Tomasz, > > On Tue, 21 Dec 2010 23:21:59 +0100, Tomasz Rybak <[email protected]> wrote: > > I have wrote attached code that calculates parallel prefix sum. > > There are two variants - the first one (exclusive) is based on article > > by Mark Harris from NVIDIA, second one (inclusive) is based on diagram > > from Wikipedia. > > Inclusive scan could be optimised with regard to shared memory access > > conflicts, similarly to the exclusive version. > > Also it seems that inclusive scan is less stable numerically - results > > differ from CPU version when calculating scan of large arrays. > > Again, thanks for your contribution! Here are a few comments on the > scan: > > - It doesn't seem like the inclusive and the exclusive version are so > dissimilar. As such, I don't think we should duplicate code for the > two. If necessary, I'd even prefer to make this code depend on Mako or > Jinja (template engines) to avoid code duplication.
Fixed code repetition, added class inheritance. I am not sure about joining exclusive and inclusive scans - they use different algorithms. For now I have added functions and classes to pycuda.reduction - feel free to move them to another module. > > - Use warnings.warn, not plain print, for warnings. Done. > > - Tests should go into tests/test_gpu_array or some such. Done, also added documentation (see patch). > > - Formal nitpicks: Please indent comments with the rest of the code. > PEP 8 says a=value (no spaces) for keyword arguments. Camel case in C > is yucky, too. :) Done, sorry. > > - 'Sum' is poor wording for the general associative operation that the > scan uses--use scan_op perhaps. Done. -- Tomasz Rybak <[email protected]> GPG/PGP key ID: 2AD5 9860 Fingerprint A481 824E 7DD3 9C0E C40A 488E C654 FB33 2AD5 9860 http://member.acm.org/~tomaszrybak
diff --git a/doc/source/array.rst b/doc/source/array.rst
index 5273df9..4bd9c92 100644
--- a/doc/source/array.rst
+++ b/doc/source/array.rst
@@ -448,6 +448,83 @@ Here's a usage example::
my_dot_prod = krnl(a, b).get()
+Parallel Prefix Scan
+--------------------
+
+In module pycuda.reduction.
+
+.. function get_exclusive_prefix_module(data_type, scan_operation, neutral, name = 'prefix_kernel', optimise = False, keep = False, options = [], preamble = '')
+
+ Generate module containing function named *name*. This function accepts
+ two pointers of type *data_type* and size of array n. The first pointer
+ points memory where result will be stored, the second points input array.
+ Function assumes that both input and output array have the same size, and
+ that size is power of 2. Function performs exclusive parallel prefix scan,
+ performing *scan_operation* on all elements. If scan operation is addition
+ (scan_operation = 'a+b') then input array {a0, a1, a2, ..., aN} is
+ transformed into {0, a0, a0+a1, a0+a1+a2, ..., a0+a1+a2+...}. *neutral*
+ is value that is neutral for particular *scan_operation*, e.g. 0 for
+ addition, 1 for multiplication. *optimise* determines whether to use more
+ shared memory to avoid bank conflicts.
+
+.. function get_exclusive_prefix_module_tail(data_type, scan_operation, neutral, name = 'prefix_kernel', optimise = False, keep = False, options = [], preamble = '')
+
+ Function similar to *get_exclusive_prefix_module* but accepting arrays
+ which sizes are not power of 2.
+
+.. function get_inclusive_prefix_module(data_type, scan_operation, neutral, name = 'prefix_kernel', optimise = False, keep = False, options = [], preamble = '')
+
+ Generate module containing function named *name*. This function accepts
+ two pointers of type *data_type* and size of array n. The first pointer
+ points memory where result will be stored, the second points input array.
+ Function assumes that both input and output array have the same size, and
+ that size is power of 2. Function performs inclusive parallel prefix scan,
+ performing *scan_operation* on all elements. If scan operation is addition
+ (scan_operation = 'a+b') then input array {a0, a1, a2, ..., aN} is
+ transformed into {a0, a0+a1, a0+a1+a2, ..., a0+a1+a2+...+aN}. *neutral*
+ is value that is neutral for particular *scan_operation*, e.g. 0 for
+ addition, 1 for multiplication. Returned function uses different algorithm
+ than *get_exclusive_prefix_module*.
+
+.. function get_inclusive_prefix_module_tail(data_type, scan_operation, neutral, name = 'prefix_kernel', optimise = False, keep = False, options = [], preamble = '')
+
+ Function similar to *get_inclusive_prefix_module* but accepting arrays
+ which sizes are not power of 2.
+
+.. class:: ExclusivePrefixKernel(data_type, scan_operation, neutral, name = 'prefix_kernel', keep = False, options = [], preamble = '')
+
+ Generates kernels that perform exclusive prefix scan on GPUArray.
+ Generated kernels perform *scan_operation* on arrays consisting of
+ *data_type* with *neutral* element.
+
+ .. method:: __call__(result, material[, size])
+
+ Invoke the generated kernels. The arguments must either be instances
+ of :class:`GPUArray`. *size* is optional; if it is missing operation
+ is performed on entire array, if it is present, only on *size* elements.
+
+.. class:: InclusivePrefixKernel(data_type, scan_operation, neutral, name = 'prefix_kernel', keep = False, options = [], preamble = '')
+
+ Generates kernels that perform inclusive prefix scan on GPUArray.
+ Generated kernels perform *scan_operation* on arrays consisting of
+ *data_type* with *neutral* element.
+
+ .. method:: __call__(result, material[, size])
+
+ Invoke the generated kernels. The arguments must either be instances
+ of :class:`GPUArray`. *size* is optional; if it is missing operation
+ is performed on entire array, if it is present, only on *size* elements.
+
+
+Here's a usage example::
+
+ a = gpuarray.arange(400, dtype=numpy.float32)
+ b = gpuarray.arange(400, dtype=numpy.float32)
+
+ krnl = InclusivePrefixKernel('float', 'a+b', '0')
+
+ krnl(a, b)
+
Fast Fourier Transforms
-----------------------
diff --git a/pycuda/reduction.py b/pycuda/reduction.py
index 474997f..d349e48 100644
--- a/pycuda/reduction.py
+++ b/pycuda/reduction.py
@@ -385,3 +385,444 @@ def get_subset_minmax_kernel(what, dtype):
"const %(tp)s *in" % {
"tp": dtype_to_ctype(dtype),
}, preamble="#define MY_INFINITY (1./0)")
+
+
+# Parallel prefix scan
+
+exclusive_scan_source = """
+ // Travel upwards, from leaves to the root of the tree
+ // During each loop time distance between nodes increases two times,
+ // and number of nodes decreases two times
+ for (int number_of_nodes = full_array_size>>1; number_of_nodes > 0;
+ number_of_nodes >>= 1) {
+ __syncthreads();
+
+ if (threadIdx.x < number_of_nodes) {
+ int source_node_index = tree_node_distance*(2*threadIdx.x+1)-1;
+ int target_node_index = tree_node_distance*(2*threadIdx.x+2)-1;
+ source_node_index += SHARED_BANK_PADDING(source_node_index);
+ target_node_index += SHARED_BANK_PADDING(target_node_index);
+
+ shared_array[target_node_index] =
+ SUM(shared_array[target_node_index],
+ shared_array[source_node_index]);
+ }
+ tree_node_distance <<= 1;
+ }
+
+ if (threadIdx.x == 0) {
+ shared_array[full_array_size-1 + SHARED_BANK_PADDING(full_array_size-1)] = %(neutral)s;
+ }
+
+ // Travel downwards, from root to the leaves
+ // During each loop number of nodes increases two times and
+ // distance between nodes decreases two times
+ for (int number_of_nodes = 1; number_of_nodes < full_array_size;
+ number_of_nodes <<= 1) {
+ tree_node_distance >>= 1;
+ __syncthreads();
+
+ if (threadIdx.x < number_of_nodes) {
+ int source_node_index = tree_node_distance*(2*threadIdx.x+1)-1;
+ int target_node_index = tree_node_distance*(2*threadIdx.x+2)-1;
+ source_node_index += SHARED_BANK_PADDING(source_node_index);
+ target_node_index += SHARED_BANK_PADDING(target_node_index);
+
+ %(data_type)s temp = shared_array[source_node_index];
+ shared_array[source_node_index] = shared_array[target_node_index];
+ shared_array[target_node_index] =
+ SUM(shared_array[target_node_index], temp);
+ }
+ }
+ __syncthreads();
+"""
+
+scan_main_head = """
+ #define SUM(a, b) (%(scan_operation)s)
+
+ %(preamble)s
+
+ __global__ void %(name)s(%(data_type)s *result,
+ const %(data_type)s *material, const int n) {
+ extern __shared__ %(data_type)s shared_array[];
+
+ int source_node_index = threadIdx.x;
+"""
+
+scan_tail_head = """
+ #define SUM(a, b) (%(scan_operation)s)
+
+ %(preamble)s
+
+ __global__ void %(name)s(%(data_type)s *result,
+ const %(data_type)s *material, const int n, const int block_offset) {
+ extern __shared__ %(data_type)s shared_array[];
+
+ int source_node_index = threadIdx.x;
+"""
+
+# Exclusive version, after Mark Harris atricle
+def get_exclusive_prefix_module(data_type, scan_operation, neutral,
+ name = 'prefix_kernel', optimise = False, keep = False, options = [],
+ preamble = ''):
+ """
+ Implementation of prefix-sum implemented based on whitepaper
+ ``Parallel Prefix Sum (Scan) with CUDA''
+ by Mark Harris from NVIDIA
+ """
+ from pycuda.compiler import SourceModule
+ if optimise:
+ preamble = preamble+"\n#define SHARED_BANK_PADDING(n) ((n) >> 4)\n"
+ else:
+ preamble = preamble+"\n#define SHARED_BANK_PADDING(n) 0\n"
+ src = (scan_main_head +
+ """
+ // Size of shared array that can contain input data
+ const int full_array_size = n;
+ const int block_offset = 2*blockIdx.x*blockDim.x;
+ int tree_node_distance = 1;
+
+ int target_node_index = threadIdx.x + (n/2);
+ shared_array[source_node_index + SHARED_BANK_PADDING(source_node_index)] =
+ material[source_node_index + block_offset];
+ shared_array[target_node_index + SHARED_BANK_PADDING(target_node_index)] =
+ material[target_node_index + block_offset];
+ """ +
+ exclusive_scan_source +
+ """
+ result[source_node_index + block_offset] =
+ shared_array[source_node_index +
+ SHARED_BANK_PADDING(source_node_index)];
+ result[target_node_index + block_offset] =
+ shared_array[target_node_index +
+ SHARED_BANK_PADDING(target_node_index)];
+ }
+ """) % {
+ 'data_type': data_type,
+ 'name': name,
+ 'neutral': neutral,
+ 'scan_operation': scan_operation,
+ 'preamble': preamble,
+ }
+ return SourceModule(src, options=options, keep=keep)
+
+def get_exclusive_prefix_module_tail(data_type, scan_operation, neutral,
+ name = 'prefix_kernel', optimise = False, keep = False, options = [],
+ preamble = ''):
+ """
+ Implementation of prefix-sum after whitepaper
+ Parallel Prefix Sum (Scan) with CUDA
+ by Mark Harris from NVIDIA
+
+ Can be used for tables which size is not power-of-2.
+ Excepts size of shared memory larger than size of input data.
+ """
+ from pycuda.compiler import SourceModule
+ if optimise:
+ preamble = preamble+"\n#define SHARED_BANK_PADDING(n) ((n) >> 4)\n"
+ else:
+ preamble = preamble+"\n#define SHARED_BANK_PADDING(n) 0\n"
+ src = (scan_tail_head +
+ """
+ // Size of shared array that can contain input data
+ // The smallest array of size power-of-2 that can contain original array
+ const int full_array_size = 1<<((int)(floor(log2((float)n))+1));
+ int tree_node_distance = 1;
+
+ int target_node_index = threadIdx.x + (n/2);
+ if (source_node_index < n) {
+ shared_array[source_node_index + SHARED_BANK_PADDING(source_node_index)] = material[source_node_index + block_offset];
+ } else {
+ shared_array[source_node_index + SHARED_BANK_PADDING(source_node_index)] = %(neutral)s;
+ }
+ if (target_node_index < n) {
+ shared_array[target_node_index + SHARED_BANK_PADDING(target_node_index)] = material[target_node_index + block_offset];
+ } else {
+ shared_array[target_node_index + SHARED_BANK_PADDING(target_node_index)] = %(neutral)s;
+ }
+ """ +
+ exclusive_scan_source +
+ """
+ if (source_node_index < n) { result[source_node_index + block_offset] = shared_array[source_node_index + SHARED_BANK_PADDING(source_node_index)]; }
+ if (target_node_index < n) { result[target_node_index + block_offset] = shared_array[target_node_index + SHARED_BANK_PADDING(target_node_index)]; }
+ }
+ """) % {
+ 'data_type': data_type,
+ 'name': name,
+ 'neutral': neutral,
+ 'scan_operation': scan_operation,
+ 'preamble': preamble,
+ }
+ return SourceModule(src, options=options, keep=keep)
+
+# Inclusive version, after http://en.wikipedia.org/wiki/Prefix_sum
+
+inclusive_scan_source = """
+ // During each loop time distance between nodes increases two times,
+ // and number of nodes decreases two times
+ for (int tree_node_distance = 1; tree_node_distance < n; tree_node_distance <<= 1) {
+ // Ensure that all values are in an array
+ __syncthreads();
+
+ int target_node_index = source_node_index + tree_node_distance;
+ %(data_type)s partial;
+ if (target_node_index < n) {
+ partial = SUM(shared_array[target_node_index],
+ shared_array[source_node_index]);
+ }
+
+ // Ensure that all sums are in registers
+ __syncthreads();
+
+ // Write to an array
+ if (target_node_index < n) {
+ shared_array[target_node_index] = partial;
+ }
+ }
+
+ __syncthreads();
+"""
+
+# This function does not use "neutral" but I have left it to maintain the same
+# list of arguments as other functions
+def get_inclusive_prefix_module(data_type, scan_operation, neutral,
+ name = 'prefix_kernel', keep = False, options = [], preamble = ''):
+ """
+ Implementation of prefix-sum implemented based on Wikipedia ``Prefix sum'' article
+ """
+ from pycuda.compiler import SourceModule
+ src = (scan_main_head +
+ """
+ const int block_offset = blockIdx.x*blockDim.x;
+ shared_array[source_node_index] = material[source_node_index + block_offset];
+ """ +
+ inclusive_scan_source +
+ """
+ result[source_node_index + block_offset] = shared_array[source_node_index];
+ }
+ """) % {
+ 'data_type': data_type,
+ 'name': name,
+ 'scan_operation': scan_operation,
+ 'preamble': preamble,
+ }
+ return SourceModule(src, options=options, keep=keep)
+
+def get_inclusive_prefix_module_tail(data_type, scan_operation, neutral,
+ name = 'prefix_kernel', keep = False, options = [], preamble = ''):
+ """
+ Implementation of prefix-sum implemented based on Wikipedia ``Prefix sum'' article
+ """
+ from pycuda.compiler import SourceModule
+ src = (scan_tail_head +
+ """
+ if (source_node_index < n) {
+ shared_array[source_node_index] = material[source_node_index + block_offset];
+ } else {
+ shared_array[source_node_index] = %(neutral)s;
+ }
+ """ +
+ inclusive_scan_source +
+ """
+ if (source_node_index < n) {
+ result[source_node_index + block_offset] = shared_array[source_node_index];
+ }
+ }
+ """) % {
+ 'data_type': data_type,
+ 'name': name,
+ 'scan_operation': scan_operation,
+ 'neutral': neutral,
+ 'preamble': preamble,
+ }
+ return SourceModule(src, options=options, keep=keep)
+
+class PrefixKernel(object):
+ def __init__(self, data_type, scan_operation, neutral,
+ name = 'prefix_kernel', keep = False, options = [], preamble = ''):
+ # Determine the size of used data type
+ self.numpy_type = pycuda.tools.parse_c_arg("const %s * in" %
+ data_type).dtype
+ item_size = self.numpy_type.itemsize
+
+ # Determine the number of threads
+ dev = pycuda.driver.Context.get_device()
+ block_size = dev.get_attribute(pycuda.driver.device_attribute.MAX_THREADS_PER_BLOCK)
+ block_dimension = dev.get_attribute(pycuda.driver.device_attribute.MAX_BLOCK_DIM_X)
+ self.block_size = min(block_size, block_dimension)
+
+ # Shared memory size: two items per thread, number of threads,
+ # size of one data item
+ self.shared_size = self.get_block_size()*item_size
+ # Padding to avoid bank conflicts
+ log_num_banks = 4
+ self.shared_size += ((self.get_block_size() >> log_num_banks) *
+ item_size)
+
+ # Check whether we do not exceed available shared memory size
+ max_shared_size = dev.get_attribute(pycuda.driver.device_attribute.MAX_SHARED_MEMORY_PER_BLOCK)
+ if self.shared_size > max_shared_size:
+ import warnings
+ warnings.warn ("Changed shared size")
+ self.shared_size = max_shared_size
+ self.block_size = self.shared_size//(2*item_size)
+
+ self.main_function = self.get_main_function(data_type, scan_operation,
+ neutral, name, keep, options, preamble)
+ self.main_function.prepare("PPi", block=(self.block_size, 1, 1),
+ shared=self.shared_size)
+ self.tail_function = self.get_tail_function(data_type, scan_operation,
+ neutral, name, keep, options, preamble)
+ self.tail_function.prepare("PPii", block=(self.block_size, 1, 1),
+ shared=self.shared_size)
+ # Use maximum available shared memory in 2.x devices
+ # TODO: is it needed as we are more limited by number of threads?
+ # Might be needed for large data types (?)
+ if dev.compute_capability() >= (2, 0):
+ cache_size = pycuda.driver.func_cache.PREFER_SHARED
+ self.main_function.set_cache_config(cache_size)
+ self.tail_function.set_cache_config(cache_size)
+ def get_block_size(self):
+ return None
+ def get_main_function(self, data_type, scan_operation, neutral,
+ name = 'prefix_kernel', keep = False, options = [], preamble = ''):
+ return None
+ def get_tail_function(self, data_type, scan_operation, neutral,
+ name = 'prefix_kernel', keep = False, options = [], preamble = ''):
+ return None
+ def __call__(self, *args, **kwargs):
+ # Look for keyword arguments, use positional when keyword cannot be found
+ result = kwargs.get('result')
+ if result is None:
+ result = args[0]
+ material = kwargs.get('material')
+ if material is None:
+ material = args[1]
+ # Accept size of data to operate on as parameter, use size of input
+ # array if not found
+ input_size = kwargs.get('size')
+ if input_size is None:
+ input_size = material.size
+ block_count = ((input_size + self.get_block_size() - 1) //
+ (self.get_block_size()))
+ if input_size != block_count * self.get_block_size():
+ if block_count > 1:
+ self.main_function.prepared_call((block_count-1, 1),
+ result.gpudata, material.gpudata, self.get_block_size())
+ self.tail_function.prepared_call((1, 1), result.gpudata,
+ material.gpudata,
+ input_size - (block_count - 1) * self.get_block_size(),
+ (block_count - 1) * self.get_block_size())
+ else:
+ self.main_function.prepared_call((block_count, 1),
+ result.gpudata, material.gpudata,
+ self.get_block_size())
+ # Get final elements from each of blocks into table
+ # Sum of the last block is not needed
+ block_sum_array_gpu = pycuda.gpuarray.zeros([block_count],
+ self.numpy_type)
+ block_sum_array = numpy.zeros(block_count).astype(self.numpy_type)
+ for i in range(block_count-1):
+ index = (i + 1) * self.get_block_size() - 1
+ block_sum_array[i] = result[index:index+1].get()[0]
+ block_sum_array_gpu.set(block_sum_array)
+ # TODO: is it possible use GPUArray per-element kernel here?
+ # Rather not (we are adding different number of values for each slice
+ # of the array, but maybe I am not seeing something)
+ self.finish_function.prepared_call((block_count, 1),
+ result.gpudata, material.gpudata,
+ block_sum_array_gpu.gpudata, input_size)
+
+class ExclusivePrefixKernel(PrefixKernel):
+ def __init__(self, data_type, scan_operation, neutral,
+ name = 'prefix_kernel', keep = False, options = [], preamble = ''):
+ super(ExclusivePrefixKernel, self).__init__(data_type, scan_operation,
+ neutral, name, keep, options, preamble)
+
+ from pycuda.compiler import SourceModule
+ finish_source = """
+ #define SUM(a, b) (%(scan_operation)s)
+
+ %(preamble)s
+
+ __global__ void finish(%(data_type)s *result,
+ const %(data_type)s *material, const %(data_type)s *partial,
+ const int n) {
+ const int x = 2*blockIdx.x*blockDim.x+threadIdx.x;
+ const int y = 2*blockIdx.x*blockDim.x+threadIdx.x+blockDim.x;
+ if (x < n) {
+ for (int i = 1; i <= blockIdx.x; i++) {
+ // Program produces exclusive scan
+ // We need to add last element from previous block
+ // to get correct sum
+ // It calculates ``inclusive'' sum for internal purposes
+ result[x] = SUM(result[x], material[2*i*blockDim.x-1]);
+ result[x] = SUM(result[x], partial[i-1]);
+ if (y < n) {
+ result[y] = SUM(result[y], material[2*i*blockDim.x-1]);
+ result[y] = SUM(result[y], partial[i-1]);
+ }
+ }
+ }
+ }
+ """ % {
+ 'data_type': data_type,
+ 'scan_operation': scan_operation,
+ 'preamble': preamble,
+ }
+ self.finish_function = SourceModule(finish_source, options=options, keep=keep).get_function('finish')
+ self.finish_function.prepare("PPPi", block=(self.block_size, 1, 1))
+ def get_block_size(self):
+ return 2 * self.block_size
+ def get_main_function(self, data_type, scan_operation, neutral,
+ name = 'prefix_kernel', keep = False, options = [], preamble = ''):
+ result = get_exclusive_prefix_module(data_type, scan_operation,
+ neutral, name, True, keep, options, preamble)
+ return result.get_function(name)
+ def get_tail_function(self, data_type, scan_operation, neutral,
+ name = 'prefix_kernel', keep = False, options = [], preamble = ''):
+ result = get_exclusive_prefix_module_tail(data_type, scan_operation,
+ neutral, name, True, keep, options, preamble)
+ return result.get_function(name)
+
+class InclusivePrefixKernel(PrefixKernel):
+ def __init__(self, data_type, scan_operation, neutral,
+ name = 'prefix_kernel', keep = False, options = [], preamble = ''):
+ super(InclusivePrefixKernel, self).__init__(data_type, scan_operation,
+ neutral, name, keep, options, preamble)
+
+ from pycuda.compiler import SourceModule
+ finish_source = """
+ #define SUM(a, b) (%(scan_operation)s)
+
+ %(preamble)s
+
+ __global__ void finish(%(data_type)s *result,
+ const %(data_type)s *material, const %(data_type)s *partial,
+ const int n) {
+ const int x = blockIdx.x*blockDim.x+threadIdx.x;
+ if (x < n) {
+ for (int i = 1; i <= blockIdx.x; i++) {
+ result[x] = SUM(result[x], partial[i-1]);
+ }
+ }
+ }
+ """ % {
+ 'data_type': data_type,
+ 'scan_operation': scan_operation,
+ 'preamble': preamble,
+ }
+ self.finish_function = SourceModule(finish_source, options=options, keep=keep).get_function('finish')
+ self.finish_function.prepare("PPPi", block=(self.block_size, 1, 1))
+ def get_block_size(self):
+ return self.block_size
+ def get_main_function(self, data_type, scan_operation, neutral,
+ name = 'prefix_kernel', keep = False, options = [], preamble = ''):
+ result = get_inclusive_prefix_module(data_type, scan_operation,
+ neutral, name, keep, options, preamble)
+ return result.get_function(name)
+ def get_tail_function(self, data_type, scan_operation, neutral,
+ name = 'prefix_kernel', keep = False, options = [], preamble = ''):
+ result = get_inclusive_prefix_module_tail(data_type, scan_operation,
+ neutral, name, keep, options, preamble)
+ return result.get_function(name)
diff --git a/test/test_gpuarray.py b/test/test_gpuarray.py
index a99136f..e13c3f2 100644
--- a/test/test_gpuarray.py
+++ b/test/test_gpuarray.py
@@ -598,6 +598,114 @@ class TestGPUArray:
assert la.norm(z.get().imag - z.imag.get()) == 0
assert la.norm(z.get().conj() - z.conj().get()) == 0
+ # Test prefix scan
+ def compute_exclusive(self, what):
+ result = numpy.zeros((what.size, ), dtype=what.dtype)
+ prefix = 0
+ result[0] = prefix
+ for i in range(what.size-1):
+ prefix += what[i]
+ result[i+1] = prefix
+ return result
+
+ def compute_inclusive(self, what):
+ result = numpy.zeros((what.size, ), dtype=what.dtype)
+ prefix = what[0]
+ result[0] = prefix
+ for i in range(1, what.size):
+ prefix += what[i]
+ result[i] = prefix
+ return result
+
+ @pycuda.tools.mark_cuda_test
+ def test_exclusive_scan_small(self):
+ from pycuda.reduction import ExclusivePrefixKernel
+ p = ExclusivePrefixKernel('float', 'a+b', '0')
+ size = p.block_size
+ a_gpu = pycuda.curandom.rand((size,), dtype=numpy.float32)
+ b_gpu = pycuda.gpuarray.zeros([size], numpy.float32)
+ a = a_gpu.get()
+ p(result=b_gpu, material=a_gpu, size=size)
+ b = self.compute_exclusive(a)
+ assert (b - b_gpu.get() < 1e-4).all()
+ @pycuda.tools.mark_cuda_test
+ def test_exclusive_scan_large(self):
+ from pycuda.reduction import ExclusivePrefixKernel
+ p = ExclusivePrefixKernel('float', 'a+b', '0')
+ size = 16*p.block_size
+ a_gpu = pycuda.curandom.rand((size,), dtype=numpy.float32)
+ b_gpu = pycuda.gpuarray.zeros([size], numpy.float32)
+ a = a_gpu.get()
+ p(result=b_gpu, material=a_gpu)
+ b = self.compute_exclusive(a)
+ assert (b - b_gpu.get() < 1e-3).all()
+ @pycuda.tools.mark_cuda_test
+ def test_exclusive_scan_small_non2(self):
+ from pycuda.reduction import ExclusivePrefixKernel
+ p = ExclusivePrefixKernel('float', 'a+b', '0')
+ size = p.block_size-3
+ a_gpu = pycuda.curandom.rand((size,), dtype=numpy.float32)
+ b_gpu = pycuda.gpuarray.zeros([size], numpy.float32)
+ a = a_gpu.get()
+ p(b_gpu, a_gpu, size=size)
+ b = self.compute_exclusive(a)
+ assert (b - b_gpu.get() < 1e-4).all()
+ @pycuda.tools.mark_cuda_test
+ def test_exclusive_scan_large_non2(self):
+ from pycuda.reduction import ExclusivePrefixKernel
+ p = ExclusivePrefixKernel('float', 'a+b', '0')
+ size = 16*p.block_size-3
+ a_gpu = pycuda.curandom.rand((size,), dtype=numpy.float32)
+ b_gpu = pycuda.gpuarray.zeros([size], numpy.float32)
+ a = a_gpu.get()
+ p(b_gpu, a_gpu)
+ b = self.compute_exclusive(a)
+ assert (b - b_gpu.get() < 1e-3).all()
+ @pycuda.tools.mark_cuda_test
+ def test_inclusive_scan_small(self):
+ from pycuda.reduction import InclusivePrefixKernel
+ p = InclusivePrefixKernel('float', 'a+b', '0')
+ size = p.block_size
+ a_gpu = pycuda.curandom.rand((size,), dtype=numpy.float32)
+ b_gpu = pycuda.gpuarray.zeros([size], numpy.float32)
+ a = a_gpu.get()
+ p(b_gpu, a_gpu, size)
+ b = self.compute_inclusive(a)
+ assert (b - b_gpu.get() < 1e-3).all()
+ @pycuda.tools.mark_cuda_test
+ def test_inclusive_scan_large(self):
+ from pycuda.reduction import InclusivePrefixKernel
+ p = InclusivePrefixKernel('float', 'a+b', '0')
+ size = 16*p.block_size
+ a_gpu = pycuda.curandom.rand((size,), dtype=numpy.float32)
+ b_gpu = pycuda.gpuarray.zeros([size], numpy.float32)
+ a = a_gpu.get()
+ p(b_gpu, a_gpu, size)
+ b = self.compute_inclusive(a)
+ assert (b - b_gpu.get() < 1e-2).all()
+ @pycuda.tools.mark_cuda_test
+ def test_inclusive_scan_small_non2(self):
+ from pycuda.reduction import InclusivePrefixKernel
+ p = InclusivePrefixKernel('float', 'a+b', '0')
+ size = p.block_size-3
+ a_gpu = pycuda.curandom.rand((size,), dtype=numpy.float32)
+ b_gpu = pycuda.gpuarray.zeros([size], numpy.float32)
+ a = a_gpu.get()
+ p(b_gpu, a_gpu, size)
+ b = self.compute_inclusive(a)
+ assert (b - b_gpu.get() < 1e-3).all()
+ @pycuda.tools.mark_cuda_test
+ def test_inclusive_scan_large_non2(self):
+ from pycuda.reduction import InclusivePrefixKernel
+ p = InclusivePrefixKernel('float', 'a+b', '0')
+ size = 16*p.block_size-3
+ a_gpu = pycuda.curandom.rand((size,), dtype=numpy.float32)
+ b_gpu = pycuda.gpuarray.zeros([size], numpy.float32)
+ a = a_gpu.get()
+ p(b_gpu, a_gpu, size)
+ b = self.compute_inclusive(a)
+ assert (b - b_gpu.get() < 1e-2).all()
+
if __name__ == "__main__":
# make sure that import failures get reported, instead of skipping the tests.
signature.asc
Description: This is a digitally signed message part
_______________________________________________ PyCUDA mailing list [email protected] http://lists.tiker.net/listinfo/pycuda
