On Tue, Apr 25, 2017 at 3:49 PM, archana sapkota <[email protected]> wrote: > Hello, > I just started working with PyCUDA. Basically whole CUDA is new to me. I was > trying to get to use the GPU to compute dot products of a large number of > vectors. Cause it was taking several days using multiple CPU cores. > > But with my first try, I am sad that I did not see the boost in speed. Here > is a piece of code that I am currently running. This is just to see how much > speedup I will be getting. My vector of interest has a dimension of around > "3000". So eventually I will be computing dot product ( or L2 norm) of those > vectors. > > I would highly appreciate if someone could suggest what I am missing and how > I could achieve my goal. > > I also see some difference in results on numpy and on GPUs. Not as big a > concern right now but I am curious why. > > Here is a sample code I m working with: > > import pycuda.gpuarray as gpuarray > import pycuda.reduction as reduction > import pycuda.driver as cuda > import pycuda.autoinit > from pycuda.compiler import SourceModule > import numpy > import time > > > krnl = reduction.ReductionKernel(numpy.float32, neutral="0", > reduce_expr="a+b", map_expr="x[i]*y[i]", > arguments="float *x, float *y") > ssd = reduction.ReductionKernel(numpy.float32, neutral="0", > reduce_expr="a+b", map_expr="(x[i] - y[i])*(x[i] - y[i])", > arguments="float *x, float *y") > > for i in range(10): > a = numpy.random.randn(3000) > b = numpy.random.randn(3000) > > a_gpu = gpuarray.to_gpu(a.astype(numpy.float32)) > b_gpu = gpuarray.to_gpu(b.astype(numpy.float32)) > > start = time.time() > numpy_dot = numpy.dot(a,b) > end = time.time() > dt = end - start > > print ("CPU time", dt) > print ("numpy_dot", numpy_dot) > print ("numpy_euclid", numpy_ssd) > > start = time.time() > my_dot_prod = krnl(a_gpu, b_gpu).get() > end = time.time() > > > dt = end - start > print ("GPU time", dt) > print ("my dot product", my_dot_prod) > print ("my euclid", my_euclid) > print ("\n") > > > Example timings are: > CPU time 5.9604644775390625e-06 > numpy_dot -19.7736554062 > numpy_ssd 5975.41368065 > GPU time 0.0009388923645019531 > my dot product -19.77365493774414 > my ssd 5975.4140625 > > > Thanks, > Arch
Several points: - The first time you invoke the kernel will be slower than subsequent invocations because of the time taken to compile the kernel. - Owing to the relatively low bandwidth of GPU to host memory transfers, you will probably not see any overall speedup for relatively short vectors such as those you are processing if you are loading a new vector into GPU memory at every iteration. You probably will see better performance processing your vectors in parallel on the CPU using something like Python's multiprocessing module or dask distributed (https://distributed.readthedocs.io/en/latest/). - Since you are using single precision floats, you will see differences in the CUDA/numpy results because of internal implementation differences. -- Lev E. Givon, PhD http://lebedov.github.io _______________________________________________ PyCUDA mailing list [email protected] https://lists.tiker.net/listinfo/pycuda
