That efficiency is not too unusual for CUDA devices. My GTX 580 3GB gives this result:
0.854149196146 utilization (1.0 is perfect utilization). Achieved bandwidth: 156 GB/s Theoretical maximum bandwidth: 183 GB/s Fastest kernel execution time: 0.000488095998764 Optimum block shape: (144, 1, 1) You should check whether ECC is turned on for your C2070 using the nvidia-smi tool. ECC slows down the memory throughput, which might explain the difference between 85% utilization and 72% utilization. On Feb 16, 2012, at 2:57 PM, Jesse Lu wrote: > Hi everyone, > > I ran a simple experiment today, which consisted of trying to maximize the > memory (device memory) throughput of a very simple kernel. I was slightly > disappointed that I was only able to achieve 72% of the theoretical maximum > bandwidth. My GPU is a C2070. The file is attached and is executed using: > > $ python test_pycuda_speed.py > 0.72196600476 utilization (1.0 is perfect utilization). > Achieved bandwidth: 98 GB/s > Theoretical maximum bandwidth: 136 GB/s > Fastest kernel execution time: 0.000777023971081 > Optimum block shape: (160, 1, 1) > . > ---------------------------------------------------------------------- > Ran 1 test in 0.814s > > OK > > The questions that I have are: > • How close can others get to the theoretical peak bandwidth? > • Any suggested tweaks to increase performance? > Thanks! > > Jesse > <test_pycuda_speed.py>_______________________________________________ > PyCUDA mailing list > [email protected] > http://lists.tiker.net/listinfo/pycuda _______________________________________________ PyCUDA mailing list [email protected] http://lists.tiker.net/listinfo/pycuda
