On Wed, Apr 4, 2012 at 3:51 PM, Pazzula, Dominic J < [email protected]> wrote:
> Basically, yes. Each block calculates its own histogram. Each block > returns an array with the histogram for that block. On the CPU sum up > those “sub” histograms into the final. > I'm not so sure that performing the reduction on the CPU is the right way here. But I could be wrong. If you really want to tackle the problem, you should try the reduction on both the GPU and the CPU and benchmark the results. **** > > *From:* Francisco Villaescusa Navarro [mailto: > [email protected]] > *Sent:* Wednesday, April 04, 2012 3:49 PM > *To:* Pazzula, Dominic J [ICG-IT] > *Cc:* 'David Mertens'; 'Francisco Villaescusa Navarro'; '[email protected]' > > *Subject:* Re: [PyCUDA] Histograms with PyCUDA**** > > ** ** > > Thanks a lot for the replies!**** > > ** ** > > I'm not sure to fully understand what you say, so please, let me say it > with my own words (if I'm wrong please let me know):**** > > ** ** > > I transfer the array with the numbers I want to grid to the GPU. Over each > element of that array I overwrite the value of the bin that corresponds to > that array's element and I return that array (containing integer numbers > with the positions of the bins) to the CPU where I make the reduction. > The first half of what you said isn't quite what I proposed. I had in mind that you would allocate a new set of memory on the device with size N_blocks x N_bins. You would have to perform atomic operations on the bin increments, which isn't great for performance because you could serialize multiple updates on the same bin, but at least you're distributing those atomic operations across many processors rather than on a single CPU. Proper bin size is critical for good performance: if your bins are too big, you'll essentially end up with serialized updates. If the bins are too small, you'll allocate far more memory than you need. > **** > > El 04/04/2012, a las 22:34, Pazzula, Dominic J escribió: > > > > **** > > Exactly what I was about to propose. Doing the reduction would probably > be faster on the CPU. NumPy + MKL would thread what is essentially a series > of element-wise array additions.**** > Actually, I would argue that the reduction of the binning from N_blocks x N_bins down to a single histogram of size N_bins would be very well suited for a parallel implementation, better suited than the binning operation, and should be much faster on the GPU than the CPU. It also saves you on data transfer back to the CPU when you're done. David -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." -- Brian Kernighan
_______________________________________________ PyCUDA mailing list [email protected] http://lists.tiker.net/listinfo/pycuda
