On Tue, Apr 25, 2017 at 3:49 PM, archana sapkota
<[email protected]> wrote:
> Hello,
> I just started working with PyCUDA. Basically whole CUDA is new to me. I was
> trying to get to use the GPU to compute dot products of a large number of
> vectors. Cause it was taking several days using multiple CPU cores.
>
> But with my first try, I am sad that I did not see the boost in speed. Here
> is a piece of code that I am currently running. This is just to see how much
> speedup I will be getting. My vector of interest has a dimension of around
> "3000". So eventually I will be computing dot product ( or L2 norm) of those
> vectors.
>
> I would highly appreciate if someone could suggest what I am missing and how
> I could achieve my goal.
>
> I also see some difference in results on numpy and on GPUs. Not as big a
> concern right now but I am curious why.
>
> Here is a sample code  I m working with:
>
> import pycuda.gpuarray as gpuarray
> import pycuda.reduction as reduction
> import pycuda.driver as cuda
> import pycuda.autoinit
> from pycuda.compiler import SourceModule
> import numpy
> import time
>
>
> krnl = reduction.ReductionKernel(numpy.float32, neutral="0",
>         reduce_expr="a+b", map_expr="x[i]*y[i]",
>         arguments="float *x, float *y")
> ssd = reduction.ReductionKernel(numpy.float32, neutral="0",
>         reduce_expr="a+b", map_expr="(x[i] - y[i])*(x[i] - y[i])",
>         arguments="float *x, float *y")
>
> for i in range(10):
>     a = numpy.random.randn(3000)
>     b = numpy.random.randn(3000)
>
>     a_gpu = gpuarray.to_gpu(a.astype(numpy.float32))
>     b_gpu = gpuarray.to_gpu(b.astype(numpy.float32))
>
>     start = time.time()
>     numpy_dot = numpy.dot(a,b)
>     end = time.time()
>     dt = end - start
>
>     print ("CPU time", dt)
>     print ("numpy_dot", numpy_dot)
>     print ("numpy_euclid", numpy_ssd)
>
>     start = time.time()
>     my_dot_prod = krnl(a_gpu, b_gpu).get()
>     end = time.time()
>
>
>     dt = end - start
>     print ("GPU time", dt)
>     print ("my dot product", my_dot_prod)
>     print ("my euclid", my_euclid)
>     print ("\n")
>
>
> Example timings are:
> CPU time 5.9604644775390625e-06
> numpy_dot -19.7736554062
> numpy_ssd 5975.41368065
> GPU time 0.0009388923645019531
> my dot product -19.77365493774414
> my ssd 5975.4140625
>
>
> Thanks,
> Arch

Several points:

- The first time you invoke the kernel will be slower than subsequent
invocations because of the time taken to compile the kernel.
- Owing to the relatively low bandwidth of GPU to host memory
transfers, you will probably not see any overall speedup for
relatively short vectors such as those you are processing if you are
loading a new vector into GPU memory at every iteration. You probably
will see better performance processing your vectors in parallel on the
CPU using something like Python's multiprocessing module or dask
distributed (https://distributed.readthedocs.io/en/latest/).
- Since you are using single precision floats, you will see
differences in the CUDA/numpy results because of internal
implementation differences.
-- 
Lev E. Givon, PhD
http://lebedov.github.io


_______________________________________________
PyCUDA mailing list
[email protected]
https://lists.tiker.net/listinfo/pycuda

Reply via email to