Do you mind posting the final code here for future reference (as a gist perhaps)?
Also, another optimization might be to remove the (slow) sqrt() in each distance calculation and then do sqrt() of the bin labels in the reduction step. On Fri, Apr 6, 2012 at 3:56 AM, Francisco Villaescusa Navarro <[email protected]> wrote: > Thanks a lot! > > You are completely right. With these changes the code is ~20% faster. > > Thanks, > > Fran. > > El 05/04/2012, a las 23:19, pierre castellani escribió: > > >> Hi Francisco, >> >> Just my 2cents on your kernel, I ve learned that pow should be avoid (in >> my old days ;-) ). Just try: >> >> >> distance=sqrt((pos_x[i]-x)*(pos_x[i]-x)+(pos_y[i]-y)*(pos_y[i]-y)+(pos_z[i]-z)*(pos_z[i]-z)); >> >> and to limit memory acces (but the compiler should do it by himself): >> >> tmpx=(pos_x[i]-x);tmpy=(pos_y[i]-y);tmpz=(pos_z[i]-z); >> distance=sqrt(tmpx*tmpx+tmpy*tmpy+tmpz*tmpz); >> >> I am really sorry but I can't test it, so if it is stupid feel free to >> tell me. >> >> Pierre. >> >> Le jeudi 05 avril 2012 à 22:30 +0200, Francisco Villaescusa Navarro a >> écrit : >>> >>> Hi all, >>> >>> >>> I have implemented your suggestions by writing (in my problem I have >>> an array with positions (pos_x, pos_y, pos_,z), and I want to compute >>> the distances of that set of particles to another one located at >>> (x,y,z), and make an histogram of that distribution of distances) >>> >>> >>> distances_gpu_template = """ >>> __global__ void dis(float *pos_x, float *pos_y, float *pos_z, float x, >>> float y, float z, int size, int *aux) >>> { >>> unsigned int idx = blockIdx.x*blockDim.x+threadIdx.x; >>> unsigned int idy = blockIdx.y*blockDim.y+threadIdx.y; >>> unsigned int id = idy*gridDim.x*blockDim.x+idx; >>> int i,bin; >>> const uint interv = %(interv)s; >>> float distance; >>> >>> >>> for(i=id;i<size;i+=blockDim.x){ >>> >>> distance=sqrt(pow(pos_x[i]-x,2)+pow(pos_y[i]-y,2)+pow(pos_z[i]-z,2)); >>> bin=(int)(distance*interv/sqrt(3.01)); >>> aux[id*interv+bin]+=1; >>> } >>> } >>> """ >>> >>> >>> reduction_gpu_template = """ >>> __global__ void red(int *aux, int *his) >>> { >>> unsigned int idx = blockIdx.x*blockDim.x+threadIdx.x; >>> unsigned int idy = blockIdx.y*blockDim.y+threadIdx.y; >>> unsigned int id = idy*gridDim.x*blockDim.x+idx; >>> const uint interv = %(interv)s; >>> >>> >>> for(int i=0;i<512;i++){ >>> his[id]+=aux[id+interv*i]; >>> } >>> } >>> """ >>> >>> >>> The code runs and generate the same results as numpy.histogram, >>> although I can only speed up the calculation by a factor 10-15, >>> whereas I expected this factor to be close to 100 for big arrays. >>> >>> >>> Do you think that there could be a better approach to the problem? >>> Would be faster to compute the matrix of values with several blocks >>> instead of the only one I'm using here (my GPU has 512 "cores")? >>> >>> >>> I have also checked whether make the reduction with the GPU or the CPU >>> changes results, but the answer is not, since it is a very quick >>> operation, and as long as it only has to be made once, no matter >>> which, GPU or CPU, unit use. >>> >>> >>> Fran. >>> >>> >>> >>> El 04/04/2012, a las 23:07, David Mertens escribió: >>> >>>> On Wed, Apr 4, 2012 at 3:51 PM, Pazzula, Dominic J >>>> <[email protected]> wrote: >>>> Basically, yes. Each block calculates its own histogram. >>>> Each block returns an array with the histogram for that >>>> block. On the CPU sum up those “sub” histograms into the >>>> final. >>>> >>>> >>>> >>>> I'm not so sure that performing the reduction on the CPU is the >>>> right way here. But I could be wrong. If you really want to tackle >>>> the problem, you should try the reduction on both the GPU and the >>>> CPU and benchmark the results. >>>> >>>> >>>> >>>> >>>> From: Francisco Villaescusa Navarro >>>> [mailto:[email protected]] >>>> Sent: Wednesday, April 04, 2012 3:49 PM >>>> To: Pazzula, Dominic J [ICG-IT] >>>> Cc: 'David Mertens'; 'Francisco Villaescusa Navarro'; >>>> '[email protected]' >>>> >>>> >>>> Subject: Re: [PyCUDA] Histograms with PyCUDA >>>> >>>> >>>> >>>> >>>> Thanks a lot for the replies! >>>> >>>> >>>> >>>> >>>> I'm not sure to fully understand what you say, so please, >>>> let me say it with my own words (if I'm wrong please let me >>>> know): >>>> >>>> >>>> >>>> >>>> >>>> I transfer the array with the numbers I want to grid to the >>>> GPU. Over each element of that array I overwrite the value >>>> of the bin that corresponds to that array's element and I >>>> return that array (containing integer numbers with the >>>> positions of the bins) to the CPU where I make the >>>> reduction. >>>> >>>> >>>> >>>> The first half of what you said isn't quite what I proposed. I had >>>> in mind that you would allocate a new set of memory on the device >>>> with size N_blocks x N_bins. You would have to perform atomic >>>> operations on the bin increments, which isn't great for performance >>>> because you could serialize multiple updates on the same bin, but at >>>> least you're distributing those atomic operations across many >>>> processors rather than on a single CPU. Proper bin size is critical >>>> for good performance: if your bins are too big, you'll essentially >>>> end up with serialized updates. If the bins are too small, you'll >>>> allocate far more memory than you need. >>>> >>>> >>>> >>>> El 04/04/2012, a las 22:34, Pazzula, Dominic J escribió: >>>> >>>> >>>> >>>> >>>> Exactly what I was about to propose. Doing the reduction >>>> would probably be faster on the CPU. NumPy + MKL would >>>> thread what is essentially a series of element-wise array >>>> additions. >>>> >>>> >>>> >>>> Actually, I would argue that the reduction of the binning from >>>> N_blocks x N_bins down to a single histogram of size N_bins would be >>>> very well suited for a parallel implementation, better suited than >>>> the binning operation, and should be much faster on the GPU than the >>>> CPU. It also saves you on data transfer back to the CPU when you're >>>> done. >>>> >>>> David >>>> >>>> >>>> -- >>>> "Debugging is twice as hard as writing the code in the first place. >>>> Therefore, if you write the code as cleverly as possible, you are, >>>> by definition, not smart enough to debug it." -- Brian Kernighan >>>> >>> >>> >>> _______________________________________________ >>> PyCUDA mailing list >>> [email protected] >>> http://lists.tiker.net/listinfo/pycuda >> >> >> > > > _______________________________________________ > PyCUDA mailing list > [email protected] > http://lists.tiker.net/listinfo/pycuda _______________________________________________ PyCUDA mailing list [email protected] http://lists.tiker.net/listinfo/pycuda
