Dear PyCUDA users,

I had mistakenly used a slightly different value for one of the parameters
in each case, and after fixing the discrepancy, the two speeds now match.

Sorry for the trouble.

-Kevin

On Fri, May 13, 2011 at 8:20 PM, Kevin Daly <[email protected]> wrote:

> Dear PyCUDA users,
>
> I have been testing the performance of two implementations of the same
> kernel function. One of them launches the kernel from Python using PyCUDA,
> while the other launches it from a c script. It appears that the PyCUDA
> implementation is systematically slower by 20% under a range of different
> conditions.
>
> My kernel takes in an array of length M, performs a calculation N times on
> each element, summing the N results for each element. It then stores the M
> sums in an output array. The 20% difference in speed persists across many
> different values of N, holding M fixed. If the difference merely
> corresponded to a longer initialization time, then I would expect the
> difference to shrink as N increases.
>
> This is how I am launching the kernel using PyCUDA:
>
> cube_file = open(cu_file_path)
> module = pycuda.compiler.SourceModule(cube_file.read(), no_extern_c=True)
> cube_file.close()
> kernel_func = module.get_function("my_kernel")
> kernel_func(drv.In(inp_array), numpy.int32(arg2), numpy.float32(arg3), ...,
> drv.Out(outp_array))
>
>
> This is how I compile the c script implementation:
>
> nvcc -ccbin /usr/bin -I. -I/usr/local/cuda/include -Xptxas -v -arch sm_20
> -c test_kernel.cu -o test_kernel.cu.o
> g++ -fPIC -o test_kernel test_kernel.cu.o  -L/usr/local/cuda/lib64 -lcudart
>
> In both cases I am launching the kernel with the same number of threads per
> block and blocks per grid.
>
> Is this the best way of compiling/launching the kernel from PyCUDA?
>
>
> -Kevin
>
_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Reply via email to