Hi! There are two issues here: 1. "avoid offloading" mechanism, and 2. "avoid offloading" policy.
On Wed, 10 Feb 2016 21:07:29 +0100, Bernd Schmidt <bschm...@redhat.com> wrote: > On 02/10/2016 06:37 PM, Thomas Schwinge wrote: > > On Wed, 10 Feb 2016 17:37:30 +0100, Bernd Schmidt <bschm...@redhat.com> > > wrote: > >> IIUC it's also disabling offloading for parallels rather than just > >> kernels, which we previously said shouldn't happen. > > > > Ah, you're talking about mixed OpenACC parallel/kernels codes -- I > > understood the earlier discussion to apply to parallel-only codes, where > > the "avoid offloading" flag will never be set. In mixed parallel/kernels > > code with one un-parallelized kernels construct, offloading would also > > (have to be) disabled for the parallel constructs (for the same data > > consistency reasons explained before). The "avoid offloading" mechanism. Owed to the non-shared-memory offloading architecture, if the compiler/runtime decides to "avoid offloading", then this has to apply to *all* code offloading, for data consistency reasons. Do we agree on that? > > The majority of codes I've seen > > use either parallel or kernels constructs, typically not both. > > That's not something I'd want to hard-code into the compiler however. > Don't know how Jakub feels but to me this approach is way too > coarse-grained. The "avoid offloading" policy. I'm looking into improving that. > > Huh? Like, at random, discouraging users from using GCC's SIMD > > vectorizer just because that one fails to vectorize some code that it > > could/should vectorize? (Of course, I'm well aware that GCC's SIMD > > vectorizer is much more mature than the OpenACC kernels/parloops > > handling; it's seen many more years of development.) > > Your description sounded like it's not actually not optimizing, but > actively hurting performance for a large selection of real world codes. Indeed single-threaded (that is, un-parallelized OpenACC kernels construct) offloading execution is hurting performance (data copy overhead; kernel launch overhead; compared to a single CPU core, a single GPU core has higher memory access latencies and is slower) -- hence the idea to resort to host-fallback execution in such a situation. > If I understood that correctly, we need to document this in the manual. OK; prototyping that on <https://gcc.gnu.org/wiki/OpenACC>. Grüße Thomas