Re: [PyCUDA] Parallel prefix scan

Tomasz Rybak Tue, 11 Jan 2011 07:28:28 -0800

Dnia 2010-12-26, nie o godzinie 17:42 +0100, Andreas Kloeckner pisze:
> Hi Tomasz,
> 
> On Tue, 21 Dec 2010 23:21:59 +0100, Tomasz Rybak <[email protected]> wrote:
> > I have wrote attached code that calculates parallel prefix sum.
> > There are two variants - the first one (exclusive) is based on article
> > by Mark Harris from NVIDIA, second one (inclusive) is based on diagram
> > from Wikipedia.
> > Inclusive scan could be optimised with regard to shared memory access
> > conflicts, similarly to the exclusive version.
> > Also it seems that inclusive scan is less stable numerically - results
> > differ from CPU version when calculating scan of large arrays.
> 
> Again, thanks for your contribution! Here are a few comments on the
> scan:
> 
> - It doesn't seem like the inclusive and the exclusive version are so
>   dissimilar. As such, I don't think we should duplicate code for the
>   two. If necessary, I'd even prefer to make this code depend on Mako or
>   Jinja (template engines) to avoid code duplication.


Fixed code repetition, added class inheritance.
I am not sure about joining exclusive and inclusive scans - they
use different algorithms.

For now I have added functions and classes to pycuda.reduction - feel
free to move them to another module.

> 
> - Use warnings.warn, not plain print, for warnings.

Done.

> 
> - Tests should go into tests/test_gpu_array or some such.

Done, also added documentation (see patch).

> 
> - Formal nitpicks: Please indent comments with the rest of the code.
>   PEP 8 says a=value (no spaces) for keyword arguments. Camel case in C
>   is yucky, too. :)

Done, sorry.

> 
> - 'Sum' is poor wording for the general associative operation that the
>   scan uses--use scan_op perhaps.

Done.


-- 
Tomasz Rybak <[email protected]> GPG/PGP key ID: 2AD5 9860
Fingerprint A481 824E 7DD3 9C0E C40A  488E C654 FB33 2AD5 9860
http://member.acm.org/~tomaszrybak

diff --git a/doc/source/array.rst b/doc/source/array.rst
index 5273df9..4bd9c92 100644
--- a/doc/source/array.rst
+++ b/doc/source/array.rst
@@ -448,6 +448,83 @@ Here's a usage example::
 
     my_dot_prod = krnl(a, b).get()
 
+Parallel Prefix Scan
+--------------------
+
+In module pycuda.reduction.
+
+.. function get_exclusive_prefix_module(data_type, scan_operation, neutral, name = 'prefix_kernel', optimise = False, keep = False, options = [], preamble = '')
+
+    Generate module containing function named *name*. This function accepts
+    two pointers of type *data_type* and size of array n. The first pointer
+    points memory where result will be stored, the second points input array.
+    Function assumes that both input and output array have the same size, and
+    that size is power of 2. Function performs exclusive parallel prefix scan,
+    performing *scan_operation* on all elements. If scan operation is addition
+    (scan_operation = 'a+b') then input array {a0, a1, a2, ..., aN} is
+    transformed into {0, a0, a0+a1, a0+a1+a2, ..., a0+a1+a2+...}. *neutral*
+    is value that is neutral for particular *scan_operation*, e.g. 0 for
+    addition, 1 for multiplication. *optimise* determines whether to use more
+    shared memory to avoid bank conflicts.
+
+.. function get_exclusive_prefix_module_tail(data_type, scan_operation, neutral, name = 'prefix_kernel', optimise = False, keep = False, options = [], preamble = '')
+
+    Function similar to *get_exclusive_prefix_module* but accepting arrays
+    which sizes are not power of 2.
+
+.. function get_inclusive_prefix_module(data_type, scan_operation, neutral, name = 'prefix_kernel', optimise = False, keep = False, options = [], preamble = '')
+
+    Generate module containing function named *name*. This function accepts
+    two pointers of type *data_type* and size of array n. The first pointer
+    points memory where result will be stored, the second points input array.
+    Function assumes that both input and output array have the same size, and
+    that size is power of 2. Function performs inclusive parallel prefix scan,
+    performing *scan_operation* on all elements. If scan operation is addition
+    (scan_operation = 'a+b') then input array {a0, a1, a2, ..., aN} is
+    transformed into {a0, a0+a1, a0+a1+a2, ..., a0+a1+a2+...+aN}. *neutral*
+    is value that is neutral for particular *scan_operation*, e.g. 0 for
+    addition, 1 for multiplication. Returned function uses different algorithm
+    than *get_exclusive_prefix_module*.
+
+.. function get_inclusive_prefix_module_tail(data_type, scan_operation, neutral, name = 'prefix_kernel', optimise = False, keep = False, options = [], preamble = '')
+
+    Function similar to *get_inclusive_prefix_module* but accepting arrays
+    which sizes are not power of 2.
+
+.. class:: ExclusivePrefixKernel(data_type, scan_operation, neutral, name = 'prefix_kernel', keep = False, options = [], preamble = '')
+
+    Generates kernels that perform exclusive prefix scan on GPUArray.
+    Generated kernels perform *scan_operation* on arrays consisting of
+    *data_type* with *neutral* element.
+
+    .. method:: __call__(result, material[, size])
+
+        Invoke the generated kernels. The arguments must either be instances
+        of :class:`GPUArray`. *size* is optional; if it is missing operation
+	is performed on entire array, if it is present, only on *size* elements.
+
+.. class:: InclusivePrefixKernel(data_type, scan_operation, neutral, name = 'prefix_kernel', keep = False, options = [], preamble = '')
+
+    Generates kernels that perform inclusive prefix scan on GPUArray.
+    Generated kernels perform *scan_operation* on arrays consisting of
+    *data_type* with *neutral* element.
+
+    .. method:: __call__(result, material[, size])
+
+        Invoke the generated kernels. The arguments must either be instances
+        of :class:`GPUArray`. *size* is optional; if it is missing operation
+	is performed on entire array, if it is present, only on *size* elements.
+
+
+Here's a usage example::
+
+    a = gpuarray.arange(400, dtype=numpy.float32)
+    b = gpuarray.arange(400, dtype=numpy.float32)
+
+    krnl = InclusivePrefixKernel('float', 'a+b', '0')
+
+    krnl(a, b)
+
 Fast Fourier Transforms
 -----------------------
 
diff --git a/pycuda/reduction.py b/pycuda/reduction.py
index 474997f..d349e48 100644
--- a/pycuda/reduction.py
+++ b/pycuda/reduction.py
@@ -385,3 +385,444 @@ def get_subset_minmax_kernel(what, dtype):
             "const %(tp)s *in"  % {
             "tp": dtype_to_ctype(dtype),
             }, preamble="#define MY_INFINITY (1./0)")
+
+
+# Parallel prefix scan
+
+exclusive_scan_source = """
+        // Travel upwards, from leaves to the root of the tree
+        // During each loop time distance between nodes increases two times,
+        // and number of nodes decreases two times
+        for (int number_of_nodes = full_array_size>>1; number_of_nodes > 0;
+            number_of_nodes >>= 1) {
+            __syncthreads();
+
+            if (threadIdx.x < number_of_nodes) {
+                int source_node_index = tree_node_distance*(2*threadIdx.x+1)-1;
+                int target_node_index = tree_node_distance*(2*threadIdx.x+2)-1;
+                source_node_index += SHARED_BANK_PADDING(source_node_index);
+                target_node_index += SHARED_BANK_PADDING(target_node_index);
+
+                shared_array[target_node_index] =
+                    SUM(shared_array[target_node_index],
+                    shared_array[source_node_index]);
+            }
+            tree_node_distance <<= 1;
+        }
+
+        if (threadIdx.x == 0) {
+            shared_array[full_array_size-1 + SHARED_BANK_PADDING(full_array_size-1)] = %(neutral)s;
+        }
+
+        // Travel downwards, from root to the leaves
+        // During each loop number of nodes increases two times and
+        // distance between nodes decreases two times
+        for (int number_of_nodes = 1; number_of_nodes < full_array_size;
+            number_of_nodes <<= 1) {
+            tree_node_distance >>= 1;
+            __syncthreads();
+
+            if (threadIdx.x < number_of_nodes) {
+                int source_node_index = tree_node_distance*(2*threadIdx.x+1)-1;
+                int target_node_index = tree_node_distance*(2*threadIdx.x+2)-1;
+                source_node_index += SHARED_BANK_PADDING(source_node_index);
+                target_node_index += SHARED_BANK_PADDING(target_node_index);
+
+                %(data_type)s temp = shared_array[source_node_index];
+                shared_array[source_node_index] = shared_array[target_node_index];
+                shared_array[target_node_index] =
+                    SUM(shared_array[target_node_index], temp);
+            }
+        }
+        __syncthreads();
+"""
+
+scan_main_head = """
+    #define SUM(a, b) (%(scan_operation)s)
+
+    %(preamble)s
+
+    __global__ void %(name)s(%(data_type)s *result,
+        const %(data_type)s *material, const int n) {
+        extern __shared__ %(data_type)s shared_array[];
+
+        int source_node_index = threadIdx.x;
+"""
+
+scan_tail_head = """
+    #define SUM(a, b) (%(scan_operation)s)
+
+    %(preamble)s
+
+    __global__ void %(name)s(%(data_type)s *result,
+        const %(data_type)s *material, const int n, const int block_offset) {
+        extern __shared__ %(data_type)s shared_array[];
+
+        int source_node_index = threadIdx.x;
+"""
+
+# Exclusive version, after Mark Harris atricle
+def get_exclusive_prefix_module(data_type, scan_operation, neutral,
+    name = 'prefix_kernel', optimise = False, keep = False, options = [],
+    preamble = ''):
+    """
+    Implementation of prefix-sum implemented based on whitepaper
+    ``Parallel Prefix Sum (Scan) with CUDA''
+    by Mark Harris from NVIDIA
+    """
+    from pycuda.compiler import SourceModule
+    if optimise:
+        preamble = preamble+"\n#define SHARED_BANK_PADDING(n) ((n) >> 4)\n"
+    else:
+        preamble = preamble+"\n#define SHARED_BANK_PADDING(n) 0\n"
+    src = (scan_main_head +
+    """
+        // Size of shared array that can contain input data
+        const int full_array_size = n;
+        const int block_offset = 2*blockIdx.x*blockDim.x;
+        int tree_node_distance = 1;
+
+        int target_node_index = threadIdx.x + (n/2);
+        shared_array[source_node_index + SHARED_BANK_PADDING(source_node_index)] =
+            material[source_node_index + block_offset];
+        shared_array[target_node_index + SHARED_BANK_PADDING(target_node_index)] =
+            material[target_node_index + block_offset];
+    """ +
+    exclusive_scan_source +
+    """
+        result[source_node_index + block_offset] =
+            shared_array[source_node_index +
+            SHARED_BANK_PADDING(source_node_index)];
+        result[target_node_index + block_offset] =
+            shared_array[target_node_index +
+            SHARED_BANK_PADDING(target_node_index)];
+    }
+    """) % {
+        'data_type': data_type,
+        'name': name,
+        'neutral': neutral,
+        'scan_operation': scan_operation,
+        'preamble': preamble,
+        }
+    return SourceModule(src, options=options, keep=keep)
+
+def get_exclusive_prefix_module_tail(data_type, scan_operation, neutral,
+    name = 'prefix_kernel', optimise = False, keep = False, options = [],
+    preamble = ''):
+    """
+    Implementation of prefix-sum after whitepaper 
+    Parallel Prefix Sum (Scan) with CUDA
+    by Mark Harris from NVIDIA
+
+    Can be used for tables which size is not power-of-2.
+    Excepts size of shared memory larger than size of input data.
+    """
+    from pycuda.compiler import SourceModule
+    if optimise:
+        preamble = preamble+"\n#define SHARED_BANK_PADDING(n) ((n) >> 4)\n"
+    else:
+        preamble = preamble+"\n#define SHARED_BANK_PADDING(n) 0\n"
+    src = (scan_tail_head +
+    """
+        // Size of shared array that can contain input data
+        // The smallest array of size power-of-2 that can contain original array
+        const int full_array_size = 1<<((int)(floor(log2((float)n))+1));
+        int tree_node_distance = 1;
+
+        int target_node_index = threadIdx.x + (n/2);
+        if (source_node_index < n) {
+            shared_array[source_node_index + SHARED_BANK_PADDING(source_node_index)] = material[source_node_index + block_offset];
+        } else {
+            shared_array[source_node_index + SHARED_BANK_PADDING(source_node_index)] = %(neutral)s;
+        }
+        if (target_node_index < n) {
+            shared_array[target_node_index + SHARED_BANK_PADDING(target_node_index)] = material[target_node_index + block_offset];
+        } else {
+            shared_array[target_node_index + SHARED_BANK_PADDING(target_node_index)] = %(neutral)s;
+        }
+    """ +
+    exclusive_scan_source +
+    """
+        if (source_node_index < n) { result[source_node_index + block_offset] = shared_array[source_node_index + SHARED_BANK_PADDING(source_node_index)]; }
+        if (target_node_index < n) { result[target_node_index + block_offset] = shared_array[target_node_index + SHARED_BANK_PADDING(target_node_index)]; }
+    }
+    """) % {
+        'data_type': data_type,
+        'name': name,
+        'neutral': neutral,
+        'scan_operation': scan_operation,
+        'preamble': preamble,
+        }
+    return SourceModule(src, options=options, keep=keep)
+
+# Inclusive version, after http://en.wikipedia.org/wiki/Prefix_sum
+
+inclusive_scan_source = """
+        // During each loop time distance between nodes increases two times,
+        // and number of nodes decreases two times
+        for (int tree_node_distance = 1; tree_node_distance < n; tree_node_distance <<= 1) {
+            // Ensure that all values are in an array
+            __syncthreads();
+
+            int target_node_index = source_node_index + tree_node_distance;
+            %(data_type)s partial;
+            if (target_node_index < n) {
+                partial = SUM(shared_array[target_node_index],
+                    shared_array[source_node_index]);
+            }
+
+            // Ensure that all sums are in registers
+            __syncthreads();
+
+            // Write to an array
+            if (target_node_index < n) {
+                shared_array[target_node_index] = partial;
+            }
+        }
+
+        __syncthreads();
+"""
+
+# This function does not use "neutral" but I have left it to maintain the same
+# list of arguments as other functions
+def get_inclusive_prefix_module(data_type, scan_operation, neutral,
+    name = 'prefix_kernel', keep = False, options = [], preamble = ''):
+    """
+    Implementation of prefix-sum implemented based on Wikipedia ``Prefix sum'' article
+    """
+    from pycuda.compiler import SourceModule
+    src = (scan_main_head +
+    """
+        const int block_offset = blockIdx.x*blockDim.x;
+        shared_array[source_node_index] = material[source_node_index + block_offset];
+    """ +
+    inclusive_scan_source + 
+    """
+        result[source_node_index + block_offset] = shared_array[source_node_index];
+    }
+    """) % {
+        'data_type': data_type,
+        'name': name,
+        'scan_operation': scan_operation,
+        'preamble': preamble,
+        }
+    return SourceModule(src, options=options, keep=keep)
+
+def get_inclusive_prefix_module_tail(data_type, scan_operation, neutral,
+    name = 'prefix_kernel', keep = False, options = [], preamble = ''):
+    """
+    Implementation of prefix-sum implemented based on Wikipedia ``Prefix sum'' article
+    """
+    from pycuda.compiler import SourceModule
+    src = (scan_tail_head +
+    """
+        if (source_node_index < n) {
+            shared_array[source_node_index] = material[source_node_index + block_offset];
+        } else {
+            shared_array[source_node_index] = %(neutral)s;
+        }
+    """ +
+    inclusive_scan_source + 
+    """
+        if (source_node_index < n) {
+            result[source_node_index + block_offset] = shared_array[source_node_index];
+        }
+    }
+    """) % {
+        'data_type': data_type,
+        'name': name,
+        'scan_operation': scan_operation,
+        'neutral': neutral,
+        'preamble': preamble,
+        }
+    return SourceModule(src, options=options, keep=keep)
+
+class PrefixKernel(object):
+    def __init__(self, data_type, scan_operation, neutral,
+        name = 'prefix_kernel', keep = False, options = [], preamble = ''):
+        # Determine the size of used data type
+        self.numpy_type = pycuda.tools.parse_c_arg("const %s * in" % 
+            data_type).dtype
+        item_size = self.numpy_type.itemsize
+
+        # Determine the number of threads
+        dev = pycuda.driver.Context.get_device()
+        block_size = dev.get_attribute(pycuda.driver.device_attribute.MAX_THREADS_PER_BLOCK)
+        block_dimension =  dev.get_attribute(pycuda.driver.device_attribute.MAX_BLOCK_DIM_X)
+        self.block_size = min(block_size, block_dimension)
+
+        # Shared memory size: two items per thread, number of threads,
+        # size of one data item
+        self.shared_size = self.get_block_size()*item_size
+        # Padding to avoid bank conflicts
+        log_num_banks = 4
+        self.shared_size += ((self.get_block_size() >> log_num_banks) * 
+            item_size)
+
+        # Check whether we do not exceed available shared memory size
+        max_shared_size = dev.get_attribute(pycuda.driver.device_attribute.MAX_SHARED_MEMORY_PER_BLOCK)
+        if self.shared_size > max_shared_size:
+            import warnings
+            warnings.warn ("Changed shared size")
+            self.shared_size = max_shared_size
+            self.block_size = self.shared_size//(2*item_size)
+
+        self.main_function = self.get_main_function(data_type, scan_operation,
+            neutral, name, keep, options, preamble)
+        self.main_function.prepare("PPi", block=(self.block_size, 1, 1),
+            shared=self.shared_size)
+        self.tail_function = self.get_tail_function(data_type, scan_operation,
+            neutral, name, keep, options, preamble)
+        self.tail_function.prepare("PPii", block=(self.block_size, 1, 1),
+            shared=self.shared_size)
+        # Use maximum available shared memory in 2.x devices
+        # TODO: is it needed as we are more limited by number of threads?
+        # Might be needed for large data types (?)
+        if dev.compute_capability() >= (2, 0):
+            cache_size = pycuda.driver.func_cache.PREFER_SHARED
+            self.main_function.set_cache_config(cache_size)
+            self.tail_function.set_cache_config(cache_size)
+    def get_block_size(self):
+        return None
+    def get_main_function(self, data_type, scan_operation, neutral,
+        name = 'prefix_kernel', keep = False, options = [], preamble = ''):
+        return None
+    def get_tail_function(self, data_type, scan_operation, neutral,
+        name = 'prefix_kernel', keep = False, options = [], preamble = ''):
+        return None
+    def __call__(self, *args, **kwargs):
+        # Look for keyword arguments, use positional when keyword cannot be found
+        result = kwargs.get('result')
+        if result is None:
+            result = args[0]
+        material = kwargs.get('material')
+        if material is None:
+           material = args[1]
+        # Accept size of data to operate on as parameter, use size of input
+        # array if not found
+        input_size = kwargs.get('size')
+        if input_size is None:
+            input_size = material.size
+        block_count = ((input_size + self.get_block_size() - 1) //
+            (self.get_block_size()))
+        if input_size != block_count * self.get_block_size():
+            if block_count > 1:
+                self.main_function.prepared_call((block_count-1, 1),
+                    result.gpudata, material.gpudata, self.get_block_size())
+            self.tail_function.prepared_call((1, 1), result.gpudata,
+                material.gpudata,
+                input_size - (block_count - 1) * self.get_block_size(),
+                (block_count - 1) * self.get_block_size())
+        else:
+            self.main_function.prepared_call((block_count, 1),
+                result.gpudata, material.gpudata,
+                self.get_block_size())
+        # Get final elements from each of blocks into table
+        # Sum of the last block is not needed
+        block_sum_array_gpu = pycuda.gpuarray.zeros([block_count],
+            self.numpy_type)
+        block_sum_array = numpy.zeros(block_count).astype(self.numpy_type)
+        for i in range(block_count-1):
+            index = (i + 1) * self.get_block_size() - 1
+            block_sum_array[i] = result[index:index+1].get()[0]
+        block_sum_array_gpu.set(block_sum_array)
+        # TODO: is it possible use GPUArray per-element kernel here?
+        # Rather not (we are adding different number of values for each slice
+        # of the array, but maybe I am not seeing something)
+        self.finish_function.prepared_call((block_count, 1),
+            result.gpudata, material.gpudata,
+            block_sum_array_gpu.gpudata, input_size)
+
+class ExclusivePrefixKernel(PrefixKernel):
+    def __init__(self, data_type, scan_operation, neutral,
+        name = 'prefix_kernel', keep = False, options = [], preamble = ''):
+        super(ExclusivePrefixKernel, self).__init__(data_type, scan_operation,
+            neutral, name, keep, options, preamble)
+
+        from pycuda.compiler import SourceModule
+        finish_source = """
+        #define SUM(a, b) (%(scan_operation)s)
+
+        %(preamble)s
+
+        __global__ void finish(%(data_type)s *result,
+            const %(data_type)s *material, const %(data_type)s *partial,
+	    const int n) {
+            const int x = 2*blockIdx.x*blockDim.x+threadIdx.x;
+            const int y = 2*blockIdx.x*blockDim.x+threadIdx.x+blockDim.x;
+            if (x < n) {
+                for (int i = 1; i <= blockIdx.x; i++) {
+                    // Program produces exclusive scan
+                    // We need to add last element from previous block
+                    // to get correct sum
+                    // It calculates ``inclusive'' sum for internal purposes
+                    result[x] = SUM(result[x], material[2*i*blockDim.x-1]);
+                    result[x] = SUM(result[x], partial[i-1]);
+                    if (y < n) {
+                        result[y] = SUM(result[y], material[2*i*blockDim.x-1]);
+                        result[y] = SUM(result[y], partial[i-1]);
+                    }
+                }
+            }
+        }
+        """ % {
+            'data_type': data_type,
+            'scan_operation': scan_operation,
+            'preamble': preamble,
+            }
+        self.finish_function = SourceModule(finish_source, options=options, keep=keep).get_function('finish')
+        self.finish_function.prepare("PPPi", block=(self.block_size, 1, 1))
+    def get_block_size(self):
+        return 2 * self.block_size
+    def get_main_function(self, data_type, scan_operation, neutral,
+        name = 'prefix_kernel', keep = False, options = [], preamble = ''):
+        result = get_exclusive_prefix_module(data_type, scan_operation,
+            neutral, name, True, keep, options, preamble)
+        return result.get_function(name)
+    def get_tail_function(self, data_type, scan_operation, neutral,
+        name = 'prefix_kernel', keep = False, options = [], preamble = ''):
+        result = get_exclusive_prefix_module_tail(data_type, scan_operation,
+            neutral, name, True, keep, options, preamble)
+        return result.get_function(name)
+
+class InclusivePrefixKernel(PrefixKernel):
+    def __init__(self, data_type, scan_operation, neutral,
+        name = 'prefix_kernel', keep = False, options = [], preamble = ''):
+        super(InclusivePrefixKernel, self).__init__(data_type, scan_operation,
+            neutral, name, keep, options, preamble)
+
+        from pycuda.compiler import SourceModule
+        finish_source = """
+        #define SUM(a, b) (%(scan_operation)s)
+
+        %(preamble)s
+
+        __global__ void finish(%(data_type)s *result,
+            const %(data_type)s *material, const %(data_type)s *partial,
+            const int n) {
+            const int x = blockIdx.x*blockDim.x+threadIdx.x;
+            if (x < n) {
+                for (int i = 1; i <= blockIdx.x; i++) {
+                    result[x] = SUM(result[x], partial[i-1]);
+                }
+            }
+        }
+        """ % {
+            'data_type': data_type,
+            'scan_operation': scan_operation,
+            'preamble': preamble,
+            }
+        self.finish_function = SourceModule(finish_source, options=options, keep=keep).get_function('finish')
+        self.finish_function.prepare("PPPi", block=(self.block_size, 1, 1))
+    def get_block_size(self):
+        return self.block_size
+    def get_main_function(self, data_type, scan_operation, neutral,
+        name = 'prefix_kernel', keep = False, options = [], preamble = ''):
+        result = get_inclusive_prefix_module(data_type, scan_operation,
+            neutral, name, keep, options, preamble)
+        return result.get_function(name)
+    def get_tail_function(self, data_type, scan_operation, neutral,
+        name = 'prefix_kernel', keep = False, options = [], preamble = ''):
+        result = get_inclusive_prefix_module_tail(data_type, scan_operation,
+            neutral, name, keep, options, preamble)
+        return result.get_function(name)
diff --git a/test/test_gpuarray.py b/test/test_gpuarray.py
index a99136f..e13c3f2 100644
--- a/test/test_gpuarray.py
+++ b/test/test_gpuarray.py
@@ -598,6 +598,114 @@ class TestGPUArray:
             assert la.norm(z.get().imag - z.imag.get()) == 0
             assert la.norm(z.get().conj() - z.conj().get()) == 0
 
+    # Test prefix scan
+    def compute_exclusive(self, what):
+        result = numpy.zeros((what.size, ), dtype=what.dtype)
+        prefix = 0
+        result[0] = prefix
+        for i in range(what.size-1):
+            prefix += what[i]
+            result[i+1] = prefix
+        return result
+
+    def compute_inclusive(self, what):
+        result = numpy.zeros((what.size, ), dtype=what.dtype)
+        prefix = what[0]
+        result[0] = prefix
+        for i in range(1, what.size):
+            prefix += what[i]
+            result[i] = prefix
+        return result
+
+    @pycuda.tools.mark_cuda_test
+    def test_exclusive_scan_small(self):
+        from pycuda.reduction import ExclusivePrefixKernel
+        p = ExclusivePrefixKernel('float', 'a+b', '0')
+        size = p.block_size
+        a_gpu = pycuda.curandom.rand((size,), dtype=numpy.float32)
+        b_gpu = pycuda.gpuarray.zeros([size], numpy.float32)
+        a = a_gpu.get()
+        p(result=b_gpu, material=a_gpu, size=size)
+        b = self.compute_exclusive(a)
+        assert (b - b_gpu.get() < 1e-4).all()
+    @pycuda.tools.mark_cuda_test
+    def test_exclusive_scan_large(self):
+        from pycuda.reduction import ExclusivePrefixKernel
+        p = ExclusivePrefixKernel('float', 'a+b', '0')
+        size = 16*p.block_size
+        a_gpu = pycuda.curandom.rand((size,), dtype=numpy.float32)
+        b_gpu = pycuda.gpuarray.zeros([size], numpy.float32)
+        a = a_gpu.get()
+        p(result=b_gpu, material=a_gpu)
+        b = self.compute_exclusive(a)
+        assert (b - b_gpu.get() < 1e-3).all()
+    @pycuda.tools.mark_cuda_test
+    def test_exclusive_scan_small_non2(self):
+        from pycuda.reduction import ExclusivePrefixKernel
+        p = ExclusivePrefixKernel('float', 'a+b', '0')
+        size = p.block_size-3
+        a_gpu = pycuda.curandom.rand((size,), dtype=numpy.float32)
+        b_gpu = pycuda.gpuarray.zeros([size], numpy.float32)
+        a = a_gpu.get()
+        p(b_gpu, a_gpu, size=size)
+        b = self.compute_exclusive(a)
+        assert (b - b_gpu.get() < 1e-4).all()
+    @pycuda.tools.mark_cuda_test
+    def test_exclusive_scan_large_non2(self):
+        from pycuda.reduction import ExclusivePrefixKernel
+        p = ExclusivePrefixKernel('float', 'a+b', '0')
+        size = 16*p.block_size-3
+        a_gpu = pycuda.curandom.rand((size,), dtype=numpy.float32)
+        b_gpu = pycuda.gpuarray.zeros([size], numpy.float32)
+        a = a_gpu.get()
+        p(b_gpu, a_gpu)
+        b = self.compute_exclusive(a)
+        assert (b - b_gpu.get() < 1e-3).all()
+    @pycuda.tools.mark_cuda_test
+    def test_inclusive_scan_small(self):
+        from pycuda.reduction import InclusivePrefixKernel
+        p = InclusivePrefixKernel('float', 'a+b', '0')
+        size = p.block_size
+        a_gpu = pycuda.curandom.rand((size,), dtype=numpy.float32)
+        b_gpu = pycuda.gpuarray.zeros([size], numpy.float32)
+        a = a_gpu.get()
+        p(b_gpu, a_gpu, size)
+        b = self.compute_inclusive(a)
+        assert (b - b_gpu.get() < 1e-3).all()
+    @pycuda.tools.mark_cuda_test
+    def test_inclusive_scan_large(self):
+        from pycuda.reduction import InclusivePrefixKernel
+        p = InclusivePrefixKernel('float', 'a+b', '0')
+        size = 16*p.block_size
+        a_gpu = pycuda.curandom.rand((size,), dtype=numpy.float32)
+        b_gpu = pycuda.gpuarray.zeros([size], numpy.float32)
+        a = a_gpu.get()
+        p(b_gpu, a_gpu, size)
+        b = self.compute_inclusive(a)
+        assert (b - b_gpu.get() < 1e-2).all()
+    @pycuda.tools.mark_cuda_test
+    def test_inclusive_scan_small_non2(self):
+        from pycuda.reduction import InclusivePrefixKernel
+        p = InclusivePrefixKernel('float', 'a+b', '0')
+        size = p.block_size-3
+        a_gpu = pycuda.curandom.rand((size,), dtype=numpy.float32)
+        b_gpu = pycuda.gpuarray.zeros([size], numpy.float32)
+        a = a_gpu.get()
+        p(b_gpu, a_gpu, size)
+        b = self.compute_inclusive(a)
+        assert (b - b_gpu.get() < 1e-3).all()
+    @pycuda.tools.mark_cuda_test
+    def test_inclusive_scan_large_non2(self):
+        from pycuda.reduction import InclusivePrefixKernel
+        p = InclusivePrefixKernel('float', 'a+b', '0')
+        size = 16*p.block_size-3
+        a_gpu = pycuda.curandom.rand((size,), dtype=numpy.float32)
+        b_gpu = pycuda.gpuarray.zeros([size], numpy.float32)
+        a = a_gpu.get()
+        p(b_gpu, a_gpu, size)
+        b = self.compute_inclusive(a)
+        assert (b - b_gpu.get() < 1e-2).all()
+
 
 if __name__ == "__main__":
     # make sure that import failures get reported, instead of skipping the tests.

signature.asc
Description: This is a digitally signed message part

_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Re: [PyCUDA] Parallel prefix scan

Reply via email to