Re: [PyCUDA] Histograms with PyCUDA

Francisco Villaescusa Navarro Wed, 11 Apr 2012 01:56:38 -0700

Hi,

Thanks for checking it.

I'm not sure to fully understand your comment. The matrix temp_grid isa matrix of size (threads, intervals). The thread 0 will create anhistogram and will save it on the first column of that matrix, thethread 1 will create an histogram and will save it on the second rowof that matrix....

Once you have the matrix filled, the reduction part will just sum allthe histograms to produce the final result.

So I think there is no memory conflict between different threads (theelements filled by one thread are never touched by other threads).

Are you saying that accessing those positions of memory every time, inthe same thread is a low step?


Thanks a lot,

Fran.

El 10/04/2012, a las 22:04, Pazzula, Dominic J escribió:

I'm seeing performance of around a 7.5x speed up on my 8500GT and 1year old Xeon. 140ms GPU vs 1047ms CPU on 10,000,000 points.

For me, the slower memory throughput kills the idea of using MKL andnumpy for the reduction. MKL can do the reduction step in lessthan .01ms and the GPU is doing it in .9ms. But the transfer timeof that 2MB (512x1024 floats) is about 2ms, giving the GPU the edge.

I am guessing here, but my thought on the slower than expectedperformance comes from the writing of the bins.


       temp_grid[id*interv+bin]+=1.0;

If I am remembering correctly, the GPU cannot write to the samememory slot simultaneously. If the memory is aligned such thatbin=0 for each core is in the same slot, then if two cores wereattempting to adjust the value in bin=0, it would take 2 cyclesinstead of 1. 10 writing means 10 cycles. Etc. Even if the memoryis not aligned, you will be attempting to write to the same block alarge number of times.

Utilizing shared memory to temporarily hold the bin counts and thenputting them into the global output matrix SHOULD increase yourtime. If I have a chance, I'll work on it and post an updated kernel.


Dominic


-----Original Message-----

From: [email protected] [mailto:[email protected]] OnBehalf Of Francisco Villaescusa Navarro

Sent: Friday, April 06, 2012 11:26 AM
To: Thomas Wiecki
Cc: [email protected]
Subject: Re: [PyCUDA] Histograms with PyCUDA

Thanks for all the suggestions!

Regarding removing sqrt: it seems that the code only gains about ~1%,
and you lose the capacity to easily define linear intervals...

I have tried with sqrt and sqrtf, but there is not difference in the
total time (or it is very small).

The code to find the histogram of an array with values between 0 and 1
should be something as:

import numpy as np
import time
import pycuda.driver as cuda
import pycuda.autoinit
import pycuda.gpuarray as gpuarray
import pycuda.cumath as cumath
from pycuda.compiler import SourceModule
from pycuda import compiler

grid_gpu_template = """
__global__ void grid(float *values, int size, float *temp_grid)
{
    unsigned int id = threadIdx.x;
    int i,bin;
    const uint interv = %(interv)s;

    for(i=id;i<size;i+=blockDim.x){
        bin=(int)(values[i]*interv);
        if (bin==interv){
           bin=interv-1;
        }
        temp_grid[id*interv+bin]+=1.0;
    }
}
"""

reduction_gpu_template = """
__global__ void reduction(float *temp_grid, float *his)
{
    unsigned int id = blockIdx.x*blockDim.x+threadIdx.x;
    const uint interv = %(interv)s;
    const uint threads = %(max_number_of_threads)s;

    if(id<interv){
        for(int i=0;i<threads;i++){
            his[id]+=temp_grid[id+interv*i];
        }
    }
}
"""

number_of_points=100000000
max_number_of_threads=512
interv=1024

blocks=interv/max_number_of_threads
if interv%max_number_of_threads!=0:
    blocks+=1

values=np.random.random(number_of_points).astype(np.float32)

grid_gpu = grid_gpu_template % {
    'interv': interv,
}
mod_grid = compiler.SourceModule(grid_gpu)
grid = mod_grid.get_function("grid")

reduction_gpu = reduction_gpu_template % {
    'interv': interv,
    'max_number_of_threads': max_number_of_threads,
}
mod_redt = compiler.SourceModule(reduction_gpu)
redt = mod_redt.get_function("reduction")

values_gpu=gpuarray.to_gpu(values)
temp_grid_gpu
=gpuarray.zeros((max_number_of_threads,interv),dtype=np.float32)
hist=np.zeros(interv,dtype=np.float32)
hist_gpu=gpuarray.to_gpu(hist)

start=time.clock()*1e3
grid
(values_gpu
,np
.int32
(number_of_points
),temp_grid_gpu,grid=(1,1),block=(max_number_of_threads,1,1))
redt(temp_grid_gpu,hist_gpu,grid=(blocks,
1),block=(max_number_of_threads,1,1))
hist=hist_gpu.get()
print 'Time used to grid with GPU:',time.clock()*1e3-start,' ms'


start=time.clock()*1e3
bins_histo=np.linspace(0.0,1.0,interv+1)
hist_CPU=np.histogram(values,bins=bins_histo)[0]
print 'Time used to grid with CPU:',time.clock()*1e3-start,' ms'

print 'max difference between methods=',np.max(hist_CPU-hist)


################

Results:

Time used to grid with GPU: 680.0  ms
Time used to grid with CPU: 9320.0  ms
max difference between methods= 0.0

So it seems that with this algorithm we can't achieve factors larger
than ~15

Fran.



_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda



_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Re: [PyCUDA] Histograms with PyCUDA

Reply via email to