Re: [PyCUDA] Compiling thrust code in pyCUDA

Andreas Kloeckner Wed, 30 May 2012 23:59:58 -0700

On Thu, 31 May 2012 12:56:15 +1200, Igor <[email protected]> wrote:
> I've updated the http://dev.math.canterbury.ac.nz/home/pub/26/
> 
> larger vector, a billion elements.
> 
> As for returning the value, it's the pair of max value and position we
> are talking about, thrust returns the position and I'm now timing the
> extraction of the value from the gpu array which didn't change timing
> too much.
> 
> ReductionKernel still appears 5 times slower than thrust.
> 
> Bryan, on the same worksheet the numpy timing is printed as well:
> argmax is 3 times slower than ReductionKernel.


I've looked at this for a little while, can't quite make heads or tails
of it yet. Here's the profiler output:

method=[ reduce_kernel_stage1 ] gputime=[ 20617.984 ] cputime=[20647.000 ] 
gridsize=[ 128, 1 ] threadblocksize=[ 512, 1, 1 ] occupancy=[ 1.000 ] 
l1_shared_bank_conflict=[ 672 ] inst_issued=[ 7906011 ]
method=[ reduce_kernel_stage2 ] gputime=[ 9.696 ] cputime=[ 29.000 ] gridsize=[ 
1, 1 ] threadblocksize=[ 512, 1, 1 ] occupancy=[ 0.333 ] 
l1_shared_bank_conflict=[ 96 ] 

method=[ _ZN6thrust<snip>] gputime=[ 3556.736 ] cputime=[ 3583.000 ] gridsize=[ 
32, 1 ] threadblocksize=[ 768, 1, 1 ] occupancy=[ 1.000 ] 
l1_shared_bank_conflict=[ 1255 ] inst_issued=[ 2964333 ] 
method=[ _ZN6thrust6<snip>] gputime=[ 8.640 ] cputime=[ 30.000 ] gridsize=[ 1, 
1 ] threadblocksize=[ 32, 1, 1 ] occupancy=[ 0.021 ] l1_shared_bank_conflict=[ 
18 ] 

Second stages are comparable, but PyCUDA receives a sound beating in the
first stage. I don't quite understand why though. Code-wise, PyCUDA and
thrust do mostly the same thing--some parameters are different, but I've
twiddled them, and they don't make a big difference. From the profile,
the main killer seems to be that thrust's code simply issues three times
fewer instructions. But I don't get why--the codes aren't that
different.

Compare yourself:

https://code.google.com/p/thrust/source/browse/thrust/system/detail/generic/extrema.inl
https://code.google.com/p/thrust/source/browse/thrust/system/cuda/detail/reduce.inl
https://code.google.com/p/thrust/source/browse/thrust/system/cuda/detail/block/reduce.h

vs

https://github.com/inducer/pycuda/blob/b28595eb92345f561096e833062f11b896013d47/pycuda/reduction.py

I've even made a version of reduction that's even more directly like
what thrust does:

https://github.com/inducer/pycuda/blob/thrusty-reduce/pycuda/reduction.py

The timing is about the same, even a tad bit slower. I'd much appreciate
any clues. Igor, can you please check if the perf difference is the same
on just a simple sum'o'floats?

Andreas

pgp5UhnuIEwmy.pgp
Description: PGP signature

_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Re: [PyCUDA] Compiling thrust code in pyCUDA

Reply via email to