On Thu, 31 May 2012 12:56:15 +1200, Igor <[email protected]> wrote: > I've updated the http://dev.math.canterbury.ac.nz/home/pub/26/ > > larger vector, a billion elements. > > As for returning the value, it's the pair of max value and position we > are talking about, thrust returns the position and I'm now timing the > extraction of the value from the gpu array which didn't change timing > too much. > > ReductionKernel still appears 5 times slower than thrust. > > Bryan, on the same worksheet the numpy timing is printed as well: > argmax is 3 times slower than ReductionKernel.
I've looked at this for a little while, can't quite make heads or tails of it yet. Here's the profiler output: method=[ reduce_kernel_stage1 ] gputime=[ 20617.984 ] cputime=[20647.000 ] gridsize=[ 128, 1 ] threadblocksize=[ 512, 1, 1 ] occupancy=[ 1.000 ] l1_shared_bank_conflict=[ 672 ] inst_issued=[ 7906011 ] method=[ reduce_kernel_stage2 ] gputime=[ 9.696 ] cputime=[ 29.000 ] gridsize=[ 1, 1 ] threadblocksize=[ 512, 1, 1 ] occupancy=[ 0.333 ] l1_shared_bank_conflict=[ 96 ] method=[ _ZN6thrust<snip>] gputime=[ 3556.736 ] cputime=[ 3583.000 ] gridsize=[ 32, 1 ] threadblocksize=[ 768, 1, 1 ] occupancy=[ 1.000 ] l1_shared_bank_conflict=[ 1255 ] inst_issued=[ 2964333 ] method=[ _ZN6thrust6<snip>] gputime=[ 8.640 ] cputime=[ 30.000 ] gridsize=[ 1, 1 ] threadblocksize=[ 32, 1, 1 ] occupancy=[ 0.021 ] l1_shared_bank_conflict=[ 18 ] Second stages are comparable, but PyCUDA receives a sound beating in the first stage. I don't quite understand why though. Code-wise, PyCUDA and thrust do mostly the same thing--some parameters are different, but I've twiddled them, and they don't make a big difference. From the profile, the main killer seems to be that thrust's code simply issues three times fewer instructions. But I don't get why--the codes aren't that different. Compare yourself: https://code.google.com/p/thrust/source/browse/thrust/system/detail/generic/extrema.inl https://code.google.com/p/thrust/source/browse/thrust/system/cuda/detail/reduce.inl https://code.google.com/p/thrust/source/browse/thrust/system/cuda/detail/block/reduce.h vs https://github.com/inducer/pycuda/blob/b28595eb92345f561096e833062f11b896013d47/pycuda/reduction.py I've even made a version of reduction that's even more directly like what thrust does: https://github.com/inducer/pycuda/blob/thrusty-reduce/pycuda/reduction.py The timing is about the same, even a tad bit slower. I'd much appreciate any clues. Igor, can you please check if the perf difference is the same on just a simple sum'o'floats? Andreas
pgp5UhnuIEwmy.pgp
Description: PGP signature
_______________________________________________ PyCUDA mailing list [email protected] http://lists.tiker.net/listinfo/pycuda
