Dear PyCUDA users,
I have been testing the performance of two implementations of the same
kernel function. One of them launches the kernel from Python using PyCUDA,
while the other launches it from a c script. It appears that the PyCUDA
implementation is systematically slower by 20% under a range of different
conditions.
My kernel takes in an array of length M, performs a calculation N times on
each element, summing the N results for each element. It then stores the M
sums in an output array. The 20% difference in speed persists across many
different values of N, holding M fixed. If the difference merely
corresponded to a longer initialization time, then I would expect the
difference to shrink as N increases.
This is how I am launching the kernel using PyCUDA:
cube_file = open(cu_file_path)
module = pycuda.compiler.SourceModule(cube_file.read(), no_extern_c=True)
cube_file.close()
kernel_func = module.get_function("my_kernel")
kernel_func(drv.In(inp_array), numpy.int32(arg2), numpy.float32(arg3), ...,
drv.Out(outp_array))
This is how I compile the c script implementation:
nvcc -ccbin /usr/bin -I. -I/usr/local/cuda/include -Xptxas -v -arch sm_20 -c
test_kernel.cu -o test_kernel.cu.o
g++ -fPIC -o test_kernel test_kernel.cu.o -L/usr/local/cuda/lib64 -lcudart
In both cases I am launching the kernel with the same number of threads per
block and blocks per grid.
Is this the best way of compiling/launching the kernel from PyCUDA?
-Kevin
_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda