Hi Francisco,
Just my 2cents on your kernel, I ve learned that pow should be avoid (in
my old days ;-) ). Just try:
distance=sqrt((pos_x[i]-x)*(pos_x[i]-x)+(pos_y[i]-y)*(pos_y[i]-y)+(pos_z[i]-z)*(pos_z[i]-z));
and to limit memory acces (but the compiler should do it by himself):
tmpx=(pos_x[i]-x);tmpy=(pos_y[i]-y);tmpz=(pos_z[i]-z);
distance=sqrt(tmpx*tmpx+tmpy*tmpy+tmpz*tmpz);
I am really sorry but I can't test it, so if it is stupid feel free to
tell me.
Pierre.
Le jeudi 05 avril 2012 à 22:30 +0200, Francisco Villaescusa Navarro a
écrit :
> Hi all,
>
>
> I have implemented your suggestions by writing (in my problem I have
> an array with positions (pos_x, pos_y, pos_,z), and I want to compute
> the distances of that set of particles to another one located at
> (x,y,z), and make an histogram of that distribution of distances)
>
>
> distances_gpu_template = """
> __global__ void dis(float *pos_x, float *pos_y, float *pos_z, float x,
> float y, float z, int size, int *aux)
> {
> unsigned int idx = blockIdx.x*blockDim.x+threadIdx.x;
> unsigned int idy = blockIdx.y*blockDim.y+threadIdx.y;
> unsigned int id = idy*gridDim.x*blockDim.x+idx;
> int i,bin;
> const uint interv = %(interv)s;
> float distance;
>
>
> for(i=id;i<size;i+=blockDim.x){
>
> distance=sqrt(pow(pos_x[i]-x,2)+pow(pos_y[i]-y,2)+pow(pos_z[i]-z,2));
> bin=(int)(distance*interv/sqrt(3.01));
> aux[id*interv+bin]+=1;
> }
> }
> """
>
>
> reduction_gpu_template = """
> __global__ void red(int *aux, int *his)
> {
> unsigned int idx = blockIdx.x*blockDim.x+threadIdx.x;
> unsigned int idy = blockIdx.y*blockDim.y+threadIdx.y;
> unsigned int id = idy*gridDim.x*blockDim.x+idx;
> const uint interv = %(interv)s;
>
>
> for(int i=0;i<512;i++){
> his[id]+=aux[id+interv*i];
> }
> }
> """
>
>
> The code runs and generate the same results as numpy.histogram,
> although I can only speed up the calculation by a factor 10-15,
> whereas I expected this factor to be close to 100 for big arrays.
>
>
> Do you think that there could be a better approach to the problem?
> Would be faster to compute the matrix of values with several blocks
> instead of the only one I'm using here (my GPU has 512 "cores")?
>
>
> I have also checked whether make the reduction with the GPU or the CPU
> changes results, but the answer is not, since it is a very quick
> operation, and as long as it only has to be made once, no matter
> which, GPU or CPU, unit use.
>
>
> Fran.
>
>
>
> El 04/04/2012, a las 23:07, David Mertens escribió:
>
> > On Wed, Apr 4, 2012 at 3:51 PM, Pazzula, Dominic J
> > <[email protected]> wrote:
> > Basically, yes. Each block calculates its own histogram.
> > Each block returns an array with the histogram for that
> > block. On the CPU sum up those “sub” histograms into the
> > final.
> >
> >
> >
> > I'm not so sure that performing the reduction on the CPU is the
> > right way here. But I could be wrong. If you really want to tackle
> > the problem, you should try the reduction on both the GPU and the
> > CPU and benchmark the results.
> >
> >
> >
> >
> > From: Francisco Villaescusa Navarro
> > [mailto:[email protected]]
> > Sent: Wednesday, April 04, 2012 3:49 PM
> > To: Pazzula, Dominic J [ICG-IT]
> > Cc: 'David Mertens'; 'Francisco Villaescusa Navarro';
> > '[email protected]'
> >
> >
> > Subject: Re: [PyCUDA] Histograms with PyCUDA
> >
> >
> >
> >
> > Thanks a lot for the replies!
> >
> >
> >
> >
> > I'm not sure to fully understand what you say, so please,
> > let me say it with my own words (if I'm wrong please let me
> > know):
> >
> >
> >
> >
> >
> > I transfer the array with the numbers I want to grid to the
> > GPU. Over each element of that array I overwrite the value
> > of the bin that corresponds to that array's element and I
> > return that array (containing integer numbers with the
> > positions of the bins) to the CPU where I make the
> > reduction.
> >
> >
> >
> > The first half of what you said isn't quite what I proposed. I had
> > in mind that you would allocate a new set of memory on the device
> > with size N_blocks x N_bins. You would have to perform atomic
> > operations on the bin increments, which isn't great for performance
> > because you could serialize multiple updates on the same bin, but at
> > least you're distributing those atomic operations across many
> > processors rather than on a single CPU. Proper bin size is critical
> > for good performance: if your bins are too big, you'll essentially
> > end up with serialized updates. If the bins are too small, you'll
> > allocate far more memory than you need.
> >
> >
> >
> > El 04/04/2012, a las 22:34, Pazzula, Dominic J escribió:
> >
> >
> >
> >
> > Exactly what I was about to propose. Doing the reduction
> > would probably be faster on the CPU. NumPy + MKL would
> > thread what is essentially a series of element-wise array
> > additions.
> >
> >
> >
> > Actually, I would argue that the reduction of the binning from
> > N_blocks x N_bins down to a single histogram of size N_bins would be
> > very well suited for a parallel implementation, better suited than
> > the binning operation, and should be much faster on the GPU than the
> > CPU. It also saves you on data transfer back to the CPU when you're
> > done.
> >
> > David
> >
> >
> > --
> > "Debugging is twice as hard as writing the code in the first place.
> > Therefore, if you write the code as cleverly as possible, you are,
> > by definition, not smart enough to debug it." -- Brian Kernighan
> >
>
>
> _______________________________________________
> PyCUDA mailing list
> [email protected]
> http://lists.tiker.net/listinfo/pycuda
_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda