Thanks a lot!

You are completely right. With these changes the code is ~20% faster.

Thanks,

Fran.

El 05/04/2012, a las 23:19, pierre castellani escribió:

Hi Francisco,

Just my 2cents on your kernel, I ve learned that pow should be avoid (in
my old days ;-) ). Just try:

distance=sqrt((pos_x[i]-x)*(pos_x[i]-x)+(pos_y[i]-y)*(pos_y[i]-y)+ (pos_z[i]-z)*(pos_z[i]-z));

and to limit memory acces (but the compiler should do it by himself):

tmpx=(pos_x[i]-x);tmpy=(pos_y[i]-y);tmpz=(pos_z[i]-z);
distance=sqrt(tmpx*tmpx+tmpy*tmpy+tmpz*tmpz);

I am really sorry but I can't test it, so if it is stupid feel free to
tell me.

Pierre.

Le jeudi 05 avril 2012 à 22:30 +0200, Francisco Villaescusa Navarro a
écrit :
Hi all,


I have implemented your suggestions by writing (in my problem I have
an array with positions (pos_x, pos_y, pos_,z), and I want to compute
the distances of that set of particles to another one located at
(x,y,z), and make an histogram of that distribution of distances)


distances_gpu_template = """
__global__ void dis(float *pos_x, float *pos_y, float *pos_z, float x,
float y, float z, int size, int *aux)
{
   unsigned int idx = blockIdx.x*blockDim.x+threadIdx.x;
   unsigned int idy = blockIdx.y*blockDim.y+threadIdx.y;
   unsigned int id = idy*gridDim.x*blockDim.x+idx;
   int i,bin;
   const uint interv = %(interv)s;
   float distance;


   for(i=id;i<size;i+=blockDim.x){

distance=sqrt(pow(pos_x[i]-x,2)+pow(pos_y[i]-y,2)+pow(pos_z[i]-z,2));
       bin=(int)(distance*interv/sqrt(3.01));
       aux[id*interv+bin]+=1;
   }
}
"""


reduction_gpu_template = """
__global__ void red(int *aux, int *his)
{
   unsigned int idx = blockIdx.x*blockDim.x+threadIdx.x;
   unsigned int idy = blockIdx.y*blockDim.y+threadIdx.y;
   unsigned int id = idy*gridDim.x*blockDim.x+idx;
   const uint interv = %(interv)s;


   for(int i=0;i<512;i++){
       his[id]+=aux[id+interv*i];
   }
}
"""


The code runs and generate the same results as numpy.histogram,
although I can only speed up the calculation by a factor 10-15,
whereas I expected this factor to be close to 100 for big arrays.


Do you think that there could be a better approach to the problem?
Would be faster to compute the matrix of values with several blocks
instead of the only one I'm using here (my GPU has 512 "cores")?


I have also checked whether make the reduction with the GPU or the CPU
changes results, but the answer is not, since it is a very quick
operation, and as long as it only has to be made once, no matter
which, GPU or CPU, unit use.


Fran.



El 04/04/2012, a las 23:07, David Mertens escribió:

On Wed, Apr 4, 2012 at 3:51 PM, Pazzula, Dominic J
<[email protected]> wrote:
       Basically, yes.  Each block calculates its own histogram.
       Each block returns an array with the histogram for that
       block.  On the CPU sum up those “sub” histograms into the
       final.



I'm not so sure that performing the reduction on the CPU is the
right way here. But I could be wrong. If you really want to tackle
the problem, you should try the reduction on both the GPU and the
CPU and benchmark the results.




       From: Francisco Villaescusa Navarro
       [mailto:[email protected]]
       Sent: Wednesday, April 04, 2012 3:49 PM
       To: Pazzula, Dominic J [ICG-IT]
       Cc: 'David Mertens'; 'Francisco Villaescusa Navarro';
       '[email protected]'


       Subject: Re: [PyCUDA] Histograms with PyCUDA




       Thanks a lot for the replies!




       I'm not sure to fully understand what you say, so please,
       let me say it with my own words (if I'm wrong please let me
       know):





       I transfer the array with the numbers I want to grid to the
       GPU. Over each element of that array I overwrite the value
       of the bin that corresponds to that array's element and I
       return that array (containing integer numbers with the
       positions of the bins) to the CPU where I make the
       reduction.



The first half of what you said isn't quite what I proposed. I had
in mind that you would allocate a new set of memory on the device
with size N_blocks x N_bins. You would have to perform atomic
operations on the bin increments, which isn't great for performance
because you could serialize multiple updates on the same bin, but at
least you're distributing those atomic operations across many
processors rather than on a single CPU. Proper bin size is critical
for good performance: if your bins are too big, you'll essentially
end up with serialized updates. If the bins are too small, you'll
allocate far more memory than you need.



       El 04/04/2012, a las 22:34, Pazzula, Dominic J escribió:




       Exactly what I was about to propose.  Doing the reduction
       would probably be faster on the CPU. NumPy + MKL would
       thread what is essentially a series of element-wise array
       additions.



Actually, I would argue that the reduction of the binning from
N_blocks x N_bins down to a single histogram of size N_bins would be
very well suited for a parallel implementation, better suited than
the binning operation, and should be much faster on the GPU than the
CPU. It also saves you on data transfer back to the CPU when you're
done.

David


--
"Debugging is twice as hard as writing the code in the first place.
 Therefore, if you write the code as cleverly as possible, you are,
 by definition, not smart enough to debug it." -- Brian Kernighan



_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda




_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Reply via email to