Hi all,
I have implemented your suggestions by writing (in my problem I have
an array with positions (pos_x, pos_y, pos_,z), and I want to compute
the distances of that set of particles to another one located at
(x,y,z), and make an histogram of that distribution of distances)
distances_gpu_template = """
__global__ void dis(float *pos_x, float *pos_y, float *pos_z, float x,
float y, float z, int size, int *aux)
{
unsigned int idx = blockIdx.x*blockDim.x+threadIdx.x;
unsigned int idy = blockIdx.y*blockDim.y+threadIdx.y;
unsigned int id = idy*gridDim.x*blockDim.x+idx;
int i,bin;
const uint interv = %(interv)s;
float distance;
for(i=id;i<size;i+=blockDim.x){
distance=sqrt(pow(pos_x[i]-x,2)+pow(pos_y[i]-y,
2)+pow(pos_z[i]-z,2));
bin=(int)(distance*interv/sqrt(3.01));
aux[id*interv+bin]+=1;
}
}
"""
reduction_gpu_template = """
__global__ void red(int *aux, int *his)
{
unsigned int idx = blockIdx.x*blockDim.x+threadIdx.x;
unsigned int idy = blockIdx.y*blockDim.y+threadIdx.y;
unsigned int id = idy*gridDim.x*blockDim.x+idx;
const uint interv = %(interv)s;
for(int i=0;i<512;i++){
his[id]+=aux[id+interv*i];
}
}
"""
The code runs and generate the same results as numpy.histogram,
although I can only speed up the calculation by a factor 10-15,
whereas I expected this factor to be close to 100 for big arrays.
Do you think that there could be a better approach to the problem?
Would be faster to compute the matrix of values with several blocks
instead of the only one I'm using here (my GPU has 512 "cores")?
I have also checked whether make the reduction with the GPU or the CPU
changes results, but the answer is not, since it is a very quick
operation, and as long as it only has to be made once, no matter
which, GPU or CPU, unit use.
Fran.
El 04/04/2012, a las 23:07, David Mertens escribió:
On Wed, Apr 4, 2012 at 3:51 PM, Pazzula, Dominic J <[email protected]
> wrote:
Basically, yes. Each block calculates its own histogram. Each
block returns an array with the histogram for that block. On the
CPU sum up those “sub” histograms into the final.
I'm not so sure that performing the reduction on the CPU is the
right way here. But I could be wrong. If you really want to tackle
the problem, you should try the reduction on both the GPU and the
CPU and benchmark the results.
From: Francisco Villaescusa Navarro [mailto:[email protected]
]
Sent: Wednesday, April 04, 2012 3:49 PM
To: Pazzula, Dominic J [ICG-IT]
Cc: 'David Mertens'; 'Francisco Villaescusa Navarro'; '[email protected]
'
Subject: Re: [PyCUDA] Histograms with PyCUDA
Thanks a lot for the replies!
I'm not sure to fully understand what you say, so please, let me say
it with my own words (if I'm wrong please let me know):
I transfer the array with the numbers I want to grid to the GPU.
Over each element of that array I overwrite the value of the bin
that corresponds to that array's element and I return that array
(containing integer numbers with the positions of the bins) to the
CPU where I make the reduction.
The first half of what you said isn't quite what I proposed. I had
in mind that you would allocate a new set of memory on the device
with size N_blocks x N_bins. You would have to perform atomic
operations on the bin increments, which isn't great for performance
because you could serialize multiple updates on the same bin, but at
least you're distributing those atomic operations across many
processors rather than on a single CPU. Proper bin size is critical
for good performance: if your bins are too big, you'll essentially
end up with serialized updates. If the bins are too small, you'll
allocate far more memory than you need.
El 04/04/2012, a las 22:34, Pazzula, Dominic J escribió:
Exactly what I was about to propose. Doing the reduction would
probably be faster on the CPU. NumPy + MKL would thread what is
essentially a series of element-wise array additions.
Actually, I would argue that the reduction of the binning from
N_blocks x N_bins down to a single histogram of size N_bins would be
very well suited for a parallel implementation, better suited than
the binning operation, and should be much faster on the GPU than the
CPU. It also saves you on data transfer back to the CPU when you're
done.
David
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." -- Brian Kernighan
_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda