I agree that data size matters in these discussions. But I think the right way to account for it is show performance at a range of data sizes, as measured from Python.
The assumption that you'll keep the GPU busy isn't necessarily true. thrust::reduce, for example (which max_element uses internally), launches a big kernel, followed by a small kernel to finish the reduction tree, followed by a cudaMemcpy to transfer the result back to the host. The GPU won't be busy during the small kernel, nor during the cudaMemcpy, nor during the conversion back to Python, etc. Reduce is often used to make control flow decisions in optimization loops, where you don't know what the next optimization step to be performed is until the result is known, and so you can't launch the work speculatively. If the control flow is performed in Python, all these overheads are exposed to application performance - so I think they matter. The fact that they're relatively less important for larger problems will be evident if the timings are made from the Python side. - bryan On Wed, May 30, 2012 at 10:20 PM, Andreas Kloeckner <[email protected]> wrote: > On Wed, 30 May 2012 21:58:13 -0700, Bryan Catanzaro <[email protected]> > wrote: >> Why should the overhead be measured separately? For users of these >> systems, the Python overhead is unavoidable. The time spent running >> on the GPU alone is an important implementation detail for people >> improving systems like PyCUDA, but users of these systems see overhead >> costs exposed in their overall application performance, and so I don't >> see how the overhead can be ignored. > > Because whether the overhead matters or not depends on data size. Since > the overhead is constant across all data sizes, that overhead is going > to be mostly irrelevant for big data, whereas for tiny data it might > well be a dealbreaker. > > That's why I think a single number doesn't cut it. > > In addition, there's an underlying assumption that you'll keep the GPU > busy for a while, i.e. keep the GPU queue saturated. If you do that (the > ability to do that being related, again, to data size), then on top of > that anything Python does runs in parallel to the GPU--and your net run > time will be exactly the same as if the overhead never happened. > > Andreas _______________________________________________ PyCUDA mailing list [email protected] http://lists.tiker.net/listinfo/pycuda
