Hi Bryan, http://dev.math.canterbury.ac.nz/home/pub/26/ now has the timing measured with Python's time.time() -- there isn't much difference. The card is Tesla C2070. Igor
On Thu, May 31, 2012 at 3:31 PM, Bryan Catanzaro <[email protected]> wrote: > Hi Igor - > I meant that it's more useful to know the execution time of code > running on the GPU from Python's perspective, since Python is the one > driving the work, and the execution overheads can be significant. > What timings do you get when you use timeit rather than CUDA events? > Also, what GPU are you running on? > > - bryan > > On Wed, May 30, 2012 at 5:56 PM, Igor <[email protected]> wrote: >> I've updated the http://dev.math.canterbury.ac.nz/home/pub/26/ >> >> larger vector, a billion elements. >> >> As for returning the value, it's the pair of max value and position we >> are talking about, thrust returns the position and I'm now timing the >> extraction of the value from the gpu array which didn't change timing >> too much. >> >> ReductionKernel still appears 5 times slower than thrust. >> >> Bryan, on the same worksheet the numpy timing is printed as well: >> argmax is 3 times slower than ReductionKernel. >> >> >> >> >> On Thu, May 31, 2012 at 12:08 PM, Andreas Kloeckner >> <[email protected]> wrote: >>> On Wed, 30 May 2012 22:13:27 +1200, Igor <[email protected]> wrote: >>>> Hi Andreas, >>>> I'm attaching an example for your wiki demonstrating how to find a max >>>> element position both using ReductionKernel and thrust-nvcc-ctypes. >>>> The latter doesn't quite work on windows yet. Should work if you're on >>>> a linux, just change the FOLDER. There is a live version published on >>>> my sage server (http://dev.math.canterbury.ac.nz/home/pub/26/ ) -- >>>> there all work and show a discouraging 5-fold slowdown of >>>> ReductionKernel as compared to thrust (run twice, as the .so file is >>>> loaded lazily?). Could you take a look and edit it if necessary? >>> >>> Not a fair comparison. The PyCUDA test includes the transfer of the >>> result to the host. (.get()) Doesn't look like that's the case for >>> thrust. Also, an 80 MB vector is tiny. At 200 GB/s, that's about 4e-4s, >>> which is in the vicinity of launch overhead. >>> >>> Andreas >> >> _______________________________________________ >> PyCUDA mailing list >> [email protected] >> http://lists.tiker.net/listinfo/pycuda _______________________________________________ PyCUDA mailing list [email protected] http://lists.tiker.net/listinfo/pycuda
