Dear PyCUDA users, I had mistakenly used a slightly different value for one of the parameters in each case, and after fixing the discrepancy, the two speeds now match.
Sorry for the trouble. -Kevin On Fri, May 13, 2011 at 8:20 PM, Kevin Daly <[email protected]> wrote: > Dear PyCUDA users, > > I have been testing the performance of two implementations of the same > kernel function. One of them launches the kernel from Python using PyCUDA, > while the other launches it from a c script. It appears that the PyCUDA > implementation is systematically slower by 20% under a range of different > conditions. > > My kernel takes in an array of length M, performs a calculation N times on > each element, summing the N results for each element. It then stores the M > sums in an output array. The 20% difference in speed persists across many > different values of N, holding M fixed. If the difference merely > corresponded to a longer initialization time, then I would expect the > difference to shrink as N increases. > > This is how I am launching the kernel using PyCUDA: > > cube_file = open(cu_file_path) > module = pycuda.compiler.SourceModule(cube_file.read(), no_extern_c=True) > cube_file.close() > kernel_func = module.get_function("my_kernel") > kernel_func(drv.In(inp_array), numpy.int32(arg2), numpy.float32(arg3), ..., > drv.Out(outp_array)) > > > This is how I compile the c script implementation: > > nvcc -ccbin /usr/bin -I. -I/usr/local/cuda/include -Xptxas -v -arch sm_20 > -c test_kernel.cu -o test_kernel.cu.o > g++ -fPIC -o test_kernel test_kernel.cu.o -L/usr/local/cuda/lib64 -lcudart > > In both cases I am launching the kernel with the same number of threads per > block and blocks per grid. > > Is this the best way of compiling/launching the kernel from PyCUDA? > > > -Kevin >
_______________________________________________ PyCUDA mailing list [email protected] http://lists.tiker.net/listinfo/pycuda
