Re: [PyCUDA] Histograms with PyCUDA

Francisco Villaescusa Navarro Thu, 05 Apr 2012 13:36:40 -0700

Hi all,

I have implemented your suggestions by writing (in my problem I havean array with positions (pos_x, pos_y, pos_,z), and I want to computethe distances of that set of particles to another one located at(x,y,z), and make an histogram of that distribution of distances)


distances_gpu_template = """

__global__ void dis(float *pos_x, float *pos_y, float *pos_z, float x,float y, float z, int size, int *aux)

{
    unsigned int idx = blockIdx.x*blockDim.x+threadIdx.x;
    unsigned int idy = blockIdx.y*blockDim.y+threadIdx.y;
    unsigned int id = idy*gridDim.x*blockDim.x+idx;
    int i,bin;
    const uint interv = %(interv)s;
    float distance;

    for(i=id;i<size;i+=blockDim.x){

distance=sqrt(pow(pos_x[i]-x,2)+pow(pos_y[i]-y,2)+pow(pos_z[i]-z,2));

        bin=(int)(distance*interv/sqrt(3.01));
        aux[id*interv+bin]+=1;
    }
}
"""

reduction_gpu_template = """
__global__ void red(int *aux, int *his)
{
    unsigned int idx = blockIdx.x*blockDim.x+threadIdx.x;
    unsigned int idy = blockIdx.y*blockDim.y+threadIdx.y;
    unsigned int id = idy*gridDim.x*blockDim.x+idx;
    const uint interv = %(interv)s;

    for(int i=0;i<512;i++){
        his[id]+=aux[id+interv*i];
    }
}
"""

The code runs and generate the same results as numpy.histogram,although I can only speed up the calculation by a factor 10-15,whereas I expected this factor to be close to 100 for big arrays.

Do you think that there could be a better approach to the problem?Would be faster to compute the matrix of values with several blocksinstead of the only one I'm using here (my GPU has 512 "cores")?

I have also checked whether make the reduction with the GPU or the CPUchanges results, but the answer is not, since it is a very quickoperation, and as long as it only has to be made once, no matterwhich, GPU or CPU, unit use.


Fran.


El 04/04/2012, a las 23:07, David Mertens escribió:

On Wed, Apr 4, 2012 at 3:51 PM, Pazzula, Dominic J <[email protected]> wrote:Basically, yes. Each block calculates its own histogram. Eachblock returns an array with the histogram for that block. On theCPU sum up those “sub” histograms into the final.
I'm not so sure that performing the reduction on the CPU is theright way here. But I could be wrong. If you really want to tacklethe problem, you should try the reduction on both the GPU and theCPU and benchmark the results.
From: Francisco Villaescusa Navarro [mailto:[email protected]]
Sent: Wednesday, April 04, 2012 3:49 PM
To: Pazzula, Dominic J [ICG-IT]
Cc: 'David Mertens'; 'Francisco Villaescusa Navarro'; '[email protected]'
Subject: Re: [PyCUDA] Histograms with PyCUDA



Thanks a lot for the replies!
I'm not sure to fully understand what you say, so please, let me sayit with my own words (if I'm wrong please let me know):
I transfer the array with the numbers I want to grid to the GPU.Over each element of that array I overwrite the value of the binthat corresponds to that array's element and I return that array(containing integer numbers with the positions of the bins) to theCPU where I make the reduction.
The first half of what you said isn't quite what I proposed. I hadin mind that you would allocate a new set of memory on the devicewith size N_blocks x N_bins. You would have to perform atomicoperations on the bin increments, which isn't great for performancebecause you could serialize multiple updates on the same bin, but atleast you're distributing those atomic operations across manyprocessors rather than on a single CPU. Proper bin size is criticalfor good performance: if your bins are too big, you'll essentiallyend up with serialized updates. If the bins are too small, you'llallocate far more memory than you need.
El 04/04/2012, a las 22:34, Pazzula, Dominic J escribió:
Exactly what I was about to propose. Doing the reduction wouldprobably be faster on the CPU. NumPy + MKL would thread what isessentially a series of element-wise array additions.
Actually, I would argue that the reduction of the binning fromN_blocks x N_bins down to a single histogram of size N_bins would bevery well suited for a parallel implementation, better suited thanthe binning operation, and should be much faster on the GPU than theCPU. It also saves you on data transfer back to the CPU when you'redone.
David

--
 "Debugging is twice as hard as writing the code in the first place.
  Therefore, if you write the code as cleverly as possible, you are,
  by definition, not smart enough to debug it." -- Brian Kernighan

_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Re: [PyCUDA] Histograms with PyCUDA

Reply via email to