First of all, I made a typo in my sample program. The value of 100000 should be 
169. That makes those array declarations less problematic, I think.
 
# define lenP 6
# define nPoints 169
...
 
__device__ void someFunction()
{
 
float residu[nPoints], newResidu[nPoints], pNew[lenP], b[lenP], deltaP[lenP];
float A[lenP*lenP], Jacobian[nPoints*lenP], B[lenP*lenP];
...
 
}
 
Unfortunately, the code section that I mentioned is quite large and I am not 
allowed to make it public.
I can say though, that it composes of calculations with the above mentioned 
arrays.
 
I have not been able to make a simple program that reproduces this effect yet, 
but I will have another look.
But still, pyCuda uses the same compiler as nvcc, right?
 
Michiel.
 


>>> Bogdan Opanchuk <[email protected]> 4/4/2012 12:55 PM >>>
Hello Michiel,

On Wed, Apr 4, 2012 at 8:39 PM, Michiel Bruinink
<[email protected]> wrote:
> I don't think streams will do any good, because I have seen that the memcpy
> time is a small part of the total time and it is the same for nvcc and
> pyCuda.

Streams can be used for kernels too, not only for operations with
memory. But I agree, from your explanations it seems that streams are
not the issue here.

> The larger pyCuda execution time is pure calculation time.
> In fact, when I comment out a section of the device code, the nvcc and
> pyCuda times are almost equal.

This sounds interesting, could you possibly quote this section here?
Or, even better, construct two simple programs, in Python and in C,
which reproduce this effect?
_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Reply via email to