On Thu, Mar 5, 2015 at 5:18 AM, Massimiliano Fatica <mfat...@gmail.com> wrote: > I would not draw too many conclusions, the SpecAcc is just telling you the > quality of the OpenACC compiler and the quality of the porting. > For example, if you look at the results for CloverLeaf ( I am familiar with > this application and have other reference points), you have: > AMD/Pathscale: 3.13 specaccel_peak > NVIDIA/PGI: 3.45 specaccel_peak
To state it again - our compiler is not perfect. There's a couple things blocking us from hitting numbers 4+ in certain benchmarks. > > > Keeping the HW constant and changing the software ( adding CUDA C and CUDA > Fortran to the mix) will give you > for the 3840x3840 grid the following average times per cell (measured in > 10^-8s): > OpenACC loops: 1.92 > OpenACC kernels: 1.78 > CUDA Fortran; 1.33 > CUDA C: 1.25 I would not compare PGI OpenACC to CUDA and draw a conclusion that OpenACC is bound to lose. If we beat PGI OpenACC by 30% that difference starts to narrow quickly. > > Timing is on a K20c, but we are interested in the relative performance. Cuda > C/Fortran in 30% faster. > There is also an OpenCL implementation of CloverLeaf but I don't have the > results. It is probably in the same ballpark. > This is a "simple" CFD code with regular access pattern, a directive base > porting gives you decent results. > You could try to run the OpenCL code on the AMD card and see how far the > Pathscale compiler is from it, but I am > expecting something similar. > > OpenACC is an interesting option for people looking for high level > programming, but you usually pay a penalty. > How big is the penalty will depend on a lot of factors and it is very > difficult to generalize. I think with poorly written CUDA or poorly written OpenACC you'll pay a penalty in both cases. I think with good OpenACC and a good compiler (after we fix some bugs) - that general perception will start to narrow. (Yes highly tuned CUDA will probably always win, but by how much) The thing to keep in mind is that in our compiler, unlike every other implementation - we are not doing any source-to-source or dumping byte-code. 1) Our code generator targets bare metal instructions 2) It's optimized for HPC - not just a recycled shader compiler Our GPU transformations *know* the hardware and how to map the right grid sizes to the resources underneath. When that mapping is done correctly and in combination with good old code generation == win. _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf