Hi!

There are two issues here: 1. "avoid offloading" mechanism, and 2. "avoid
offloading" policy.

On Wed, 10 Feb 2016 21:07:29 +0100, Bernd Schmidt <bschm...@redhat.com> wrote:
> On 02/10/2016 06:37 PM, Thomas Schwinge wrote:
> > On Wed, 10 Feb 2016 17:37:30 +0100, Bernd Schmidt <bschm...@redhat.com> 
> > wrote:
> >> IIUC it's also disabling offloading for parallels rather than just
> >> kernels, which we previously said shouldn't happen.
> >
> > Ah, you're talking about mixed OpenACC parallel/kernels codes -- I
> > understood the earlier discussion to apply to parallel-only codes, where
> > the "avoid offloading" flag will never be set.  In mixed parallel/kernels
> > code with one un-parallelized kernels construct, offloading would also
> > (have to be) disabled for the parallel constructs (for the same data
> > consistency reasons explained before).

The "avoid offloading" mechanism.  Owed to the non-shared-memory
offloading architecture, if the compiler/runtime decides to "avoid
offloading", then this has to apply to *all* code offloading, for data
consistency reasons.  Do we agree on that?

> > The majority of codes I've seen
> > use either parallel or kernels constructs, typically not both.
> 
> That's not something I'd want to hard-code into the compiler however. 
> Don't know how Jakub feels but to me this approach is way too 
> coarse-grained.

The "avoid offloading" policy.  I'm looking into improving that.


> > Huh?  Like, at random, discouraging users from using GCC's SIMD
> > vectorizer just because that one fails to vectorize some code that it
> > could/should vectorize?  (Of course, I'm well aware that GCC's SIMD
> > vectorizer is much more mature than the OpenACC kernels/parloops
> > handling; it's seen many more years of development.)
> 
> Your description sounded like it's not actually not optimizing, but 
> actively hurting performance for a large selection of real world codes. 

Indeed single-threaded (that is, un-parallelized OpenACC kernels
construct) offloading execution is hurting performance (data copy
overhead; kernel launch overhead; compared to a single CPU core, a single
GPU core has higher memory access latencies and is slower) -- hence the
idea to resort to host-fallback execution in such a situation.

> If I understood that correctly, we need to document this in the manual.

OK; prototyping that on <https://gcc.gnu.org/wiki/OpenACC>.


Grüße
 Thomas

Reply via email to