On 06/29/2016 07:11 AM, Thomas Schwinge wrote:
> Cesar, I have not yet fully digested this, but do I understand right that
> you're really fixing two issues here, that are related (OpenACC routines)
> but still can be addressed independently of each other? Do I understand
> right that the first one, the "problems with acc routines [...]
> incorrectly permitting 'acc seq' loops to call gang, worker and vector
> routines" is just a Fortran front end patch? If yes, please split that
> one out, so as to reduce the volume of remaining changes that remain to
> be discussed.
This patch addresses the following issues:
1. Issues warnings when a non-acc routine function is called inside an
OpenACC offloaded region.
2. It corrects a bug what was allowing seq loops to call gang, worker
and vector routines.
3. It adds supports for acc routines in fortran modules (which I
noticed was missing when I added 'acc routine seq' to acc_on_device
in the fortran openacc include files).
I'll split these into separate patches.
> On Thu, 23 Jun 2016 09:05:38 -0700, Cesar Philippidis
> <[email protected]> wrote:
>> On 06/17/2016 07:42 AM, Jakub Jelinek wrote:
>>> On Wed, Jun 15, 2016 at 08:12:15PM -0700, Cesar Philippidis wrote:
>>>> The second set of changes involves teaching the gimplifier to error when
>>>> it detects a function call to an non-acc routines inside an OpenACC
>>>> offloaded region.
>
> As I understand, that's the same problem as has been discussed before
> (Ilya CCed), and has recently again been filed in
> <https://gcc.gnu.org/PR71499> "ICE in LTO1 when attempting NVPTX
> offloading (-fopenacc)", and <https://gcc.gnu.org/PR71535> "ICE in LTO1
> with -fopenmp offloading" (Alexander CCed). Some earlier discussion
> threads include:
> <http://news.gmane.org/find-root.php?message_id=%3C20150109145702.GA45210%40msticlxl57.ims.intel.com%3E>,
> <http://news.gmane.org/find-root.php?message_id=%3C20150724152119.GA41292%40msticlxl57.ims.intel.com%3E>,
> <http://news.gmane.org/find-root.php?message_id=%3C56269C05.6030502%40acm.org%3E>.
>
>>>> Actually, I relaxed non-acc routines by excluding
>>>> calls to builtin functions, including those prefixed with _gfortran_.
>>>> Nvptx does have a newlib c library, and it also has a subset of
>>>> libgfortran. Still, this solution is probably not optimal.
>>>
>>> I don't really like that, hardcoding prefixes or whatever is available
>>> (you have quite some subset of libc, libm etc. available too) in the
>>> compiler looks very hackish. What is wrong with complaining during
>>> linking of the offloaded code?
>
> ACK. Jakub, do I understand you correctly, that you basically say that
> every function declaration that is in scope inside offloaded regions (for
> example, GCC builtin functions, or standard library functions declared in
> target compiler's header files) is permitted to be called in offloaded
> regions, and the offloading compiler will then either be able to resolve
> these (nvptx back end knows about trigonometric functions, for example,
> and a lot of functions are available in the nvptx libc), or otherwise
> error out during the offloading compilation (during linking), gracefully
> without terminating the target compilation (that "gracefully" bit is
> currently missing -- that's for another day). That is, all such
> functions are implicitly callable as OpenACC "seq" functions (which means
> that they don't internally use gang/worker/vector parallelism). In
> particular, all these functions do *not* need to be marked with an
> explicit "#pragma acc routine seq" directive. (Functions internally
> using gang/worker/vector parallelism will need to be marked
> appropriately, using a "#pragma acc routine gang/worker/vector"
> directive.) That's how I understand your comment above, and your earlier
> comments on this topic, and also is what I think should be done.
OK. I'll drop the warning changes from my patch set then unless you want
to keep it.
> A few random comments on the patch:
>
>> --- a/gcc/fortran/gfortran.h
>> +++ b/gcc/fortran/gfortran.h
>> @@ -303,6 +303,15 @@ enum save_state
>> { SAVE_NONE = 0, SAVE_EXPLICIT, SAVE_IMPLICIT
>> };
>>
>> +/* Flags to keep track of ACC routine states. */
>> +enum oacc_function
>> +{ OACC_FUNCTION_NONE = 0,
>> + OACC_FUNCTION_SEQ,
>> + OACC_FUNCTION_GANG,
>> + OACC_FUNCTION_WORKER,
>> + OACC_FUNCTION_VECTOR
>> +};
>
> What's the purpose of OACC_FUNCTION_NONE? It's not used anywhere, as far
> as I can tell?
It's used by the fortran module code. It controls how parallelism gets
encoded in the .mod files.
>> --- a/gcc/fortran/openmp.c
>> +++ b/gcc/fortran/openmp.c
>> @@ -1664,21 +1664,31 @@ gfc_match_oacc_cache (void)
>>
>> /* Determine the loop level for a routine. */
>>
>> -static int
>> +static oacc_function
>> gfc_oacc_routine_dims (gfc_omp_clauses *clauses)
>> {
>> int level = -1;
>> + oacc_function ret = OACC_FUNCTION_SEQ;
>>
>> if (clauses)
>> {
>> unsigned mask = 0;
>>
>> if (clauses->gang)
>> - level = GOMP_DIM_GANG, mask |= GOMP_DIM_MASK (level);
>> + {
>> + level = GOMP_DIM_GANG, mask |= GOMP_DIM_MASK (level);
>> + ret = OACC_FUNCTION_GANG;
>> + }
>> if (clauses->worker)
>> - level = GOMP_DIM_WORKER, mask |= GOMP_DIM_MASK (level);
>> + {
>> + level = GOMP_DIM_WORKER, mask |= GOMP_DIM_MASK (level);
>> + ret = OACC_FUNCTION_WORKER;
>> + }
>> if (clauses->vector)
>> - level = GOMP_DIM_VECTOR, mask |= GOMP_DIM_MASK (level);
>> + {
>> + level = GOMP_DIM_VECTOR, mask |= GOMP_DIM_MASK (level);
>> + ret = OACC_FUNCTION_VECTOR;
>> + }
>> if (clauses->seq)
>> level = GOMP_DIM_MAX, mask |= GOMP_DIM_MASK (level);
>>
>
> I have not looked in detail, so maybe I'm misunderstanding what is being
> done here -- but how do "clauses->seq" and "level = GOMP_DIM_MAX" fit
> together?
Conceptually, if you take a look at the oacc_function attribute in a
tree dump, you'll see an array with three elements. Basically, each
element in that array represents a gang, worker or vector parallelism.
By definition, a gang loop permits a worker and vector loop to be nested
inside it. So, for a gang routine, the oacc_function attribute is
constructed such that it permits gang, worker and vector level
parallelism. Similarly, for a worker routine, the oacc_function
attribute has the worker and vector level parallelism 'bits' set.
With that in mind, setting seq to GOMP_DIM_MASK allows the loop creating
that oacc_function attribute to mask out any gang, worker and vector
parallelism.
>> @@ -1689,7 +1699,7 @@ gfc_oacc_routine_dims (gfc_omp_clauses *clauses)
>> if (level < 0)
>> level = GOMP_DIM_MAX;
>>
>> - return level;
>> + return ret;
>> }
>
> Just from that last hunk, it seems that the assignment to "level" is a
> dead store?
I'll need to check this when I split out the patch.
>> +static tree
>> +add_attributes_to_decl (symbol_attribute sym_attr, tree list)
>> +{
>> + unsigned id;
>> + tree attr;
>> +
>> + for (id = 0; id < EXT_ATTR_NUM; id++)
>> + if (sym_attr.ext_attr & (1 << id))
>> + {
>> + attr = build_tree_list (
>> + get_identifier (ext_attr_list[id].middle_end_name),
>> + NULL_TREE);
>> + list = chainon (list, attr);
>> + }
>> +
>> + list = add_omp_offloading_attributes (sym_attr.omp_declare_target,
>> + sym_attr.oacc_function, list);
>> +
>> + return list;
>> +}
>
> Something that I had noticed before, possibly related here: code in
> gcc/fortran/ does never call replace_oacc_fn_attrib, but the C and C++
> front ends do. Is that function what you've re-implemented here?
Similar, but I broke this code out from another function to handle
BUILT_IN_EXPECT. But I can revert this change now, since BUILT_IN_EXPECT
will be treated as an implicit SEQ routine.
>> --- a/gcc/lto-cgraph.c
>> +++ b/gcc/lto-cgraph.c
>> @@ -1201,9 +1201,11 @@ input_overwrite_node (struct lto_file_decl_data
>> *file_data,
>> LDPR_NUM_KNOWN);
>> node->instrumentation_clone = bp_unpack_value (bp, 1);
>> node->split_part = bp_unpack_value (bp, 1);
>> - gcc_assert (flag_ltrans
>> - || (!node->in_other_partition
>> - && !node->used_from_other_partition));
>> +
>> + int success = flag_ltrans || (!node->in_other_partition
>> + && !node->used_from_other_partition);
>> + if (!success)
>> + error ("Missing %<%s%>", node->name ());
>> }
>>
>> /* Return string alias is alias of. */
>> @@ -1416,9 +1418,11 @@ input_varpool_node (struct lto_file_decl_data
>> *file_data,
>> node->set_section_for_node (section);
>> node->resolution = streamer_read_enum (ib, ld_plugin_symbol_resolution,
>> LDPR_NUM_KNOWN);
>> - gcc_assert (flag_ltrans
>> - || (!node->in_other_partition
>> - && !node->used_from_other_partition));
>> +
>> + int success = flag_ltrans || (!node->in_other_partition
>> + && !node->used_from_other_partition);
>> + if (!success)
>> + error ("Missing %<%s%>", node->name ());
>>
>> return node;
>> }
>
> That looks similar to what I remember from earlier, simiar patches, as
> referenced above.
It is. I never got around to pushing that patch very strongly because I
thought those link failures were legitimate compiler bugs.
>> @@ -19420,7 +19442,8 @@ oacc_loop_fixed_partitions (oacc_loop *loop,
>> unsigned outer_mask)
>> {
>> unsigned outermost = this_mask & -this_mask;
>>
>> - if (outermost && outermost <= outer_mask)
>> + if ((outermost && outermost <= outer_mask)
>> + || (this_mask && (loop->parent->flags & OLF_SEQ)))
>> {
>> if (noisy)
>> {
>
>> --- a/gcc/testsuite/c-c++-common/goacc/routine-3.c
>> +++ b/gcc/testsuite/c-c++-common/goacc/routine-3.c
>> @@ -49,7 +49,7 @@ main ()
>> int red = 0;
>> #pragma acc parallel copy (red)
>> {
>> - /* Independent/seq loop tests. */
>> + /* Independent loop tests. */
>> #pragma acc loop reduction (+:red) // { dg-warning "insufficient
>> partitioning" }
>> for (int i = 0; i < 10; i++)
>> red += gang ();
>> @@ -62,6 +62,19 @@ main ()
>> for (int i = 0; i < 10; i++)
>> red += vector ();
>>
>> + /* Seq loop tests. */
>> +#pragma acc loop seq reduction (+:red) /* { dg-message "containing loop" }
>> */
>> + for (int i = 0; i < 10; i++)
>> + red += gang (); /* { dg-error "incorrectly nested" } */
>> +
>> +#pragma acc loop seq reduction (+:red) /* { dg-message "containing loop" }
>> */
>> + for (int i = 0; i < 10; i++)
>> + red += worker (); /* { dg-error "incorrectly nested" } */
>> +
>> +#pragma acc loop seq reduction (+:red) /* { dg-message "containing loop" }
>> */
>> + for (int i = 0; i < 10; i++)
>> + red += vector (); /* { dg-error "incorrectly nested" } */
>> +
>> /* Gang routine tests. */
>> #pragma acc loop gang reduction (+:red) /* { dg-message "containing loop"
>> } */
>> for (int i = 0; i < 10; i++)
>
> Do these test case changes actually relate to any of the compiler changes
> discussed above? Maybe to the oacc_loop_fixed_partitions cited just
> above? Is that a separate issue to fix? Eh, or is that actually the fix
> for your first issue, the "problems with acc routines [...] incorrectly
> permitting 'acc seq' loops to call gang, worker and vector routines"?
This is issue 2, and I'll break it out into a separate patch.
>> ---
>> a/libgomp/testsuite/libgomp.oacc-c-c++-common/data-clauses-kernels-ipa-pta.c
>> +++
>> b/libgomp/testsuite/libgomp.oacc-c-c++-common/data-clauses-kernels-ipa-pta.c
>> @@ -1,4 +1,4 @@
>> /* { dg-do run { target lto } } */
>> -/* { dg-additional-options "-fipa-pta -flto -flto-partition=max" } */
>> +/* { dg-additional-options "-fipa-pta -flto -flto-partition=max
>> -fno-exceptions" } */
>>
>> #include "data-clauses-kernels.c"
>
>> --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/data-clauses-kernels.c
>> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/data-clauses-kernels.c
>> @@ -1,2 +1,4 @@
>> +/* { dg-additional-options "-fno-exceptions" } */
>> +
>> #define CONSTRUCT kernels
>> #include "data-clauses.h"
>
>> ---
>> a/libgomp/testsuite/libgomp.oacc-c-c++-common/data-clauses-parallel-ipa-pta.c
>> +++
>> b/libgomp/testsuite/libgomp.oacc-c-c++-common/data-clauses-parallel-ipa-pta.c
>> @@ -1,4 +1,4 @@
>> /* { dg-do run { target lto } } */
>> -/* { dg-additional-options "-fipa-pta -flto -flto-partition=max" } */
>> +/* { dg-additional-options "-fipa-pta -flto -flto-partition=max
>> -fno-exceptions" } */
>>
>> #include "data-clauses-parallel.c"
>
>> --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/data-clauses-parallel.c
>> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/data-clauses-parallel.c
>> @@ -1,2 +1,4 @@
>> +/* { dg-additional-options "-fno-exceptions" } */
>> +
>> #define CONSTRUCT parallel
>> #include "data-clauses.h"
>
> Hmm?
I'm not sure what happened here either. Maybe adding the 'acc routine'
directive to acc_on_device is preventing that function from expanding to
its builtin function counterpart, which caused gcc to generate exception
code?
Cesar