Re: [PATCH, nvptx] Expand OpenACC child function arguments to use CUDA params space

Chung-Lin Tang Tue, 24 Sep 2019 03:44:23 -0700

Hi Thomas, thanks for the review.

On 2019/9/20 12:28 AM, Thomas Schwinge wrote:

This new implementation works by modifying the GIMPLE for child functions
directly at the very start (before, actually) of RTL expansion

That's now near the other end of the pipeline.;-)  What's the motivation
for putting it there, instead of early in the nvptx offloading
compilation (around 'pass_oacc_device_lower' etc. time, where I would've
assumed this transformation to be done)?  Not asking you to change that
now, but curious for the reason.


I am not sure we have a natural boundary that defines/marks the start of the
offload compiler stages. Maybe if we had an explicit "start_of_offload" pass,
we can embed this processing there, and enable it with a bool-valued target hook
by the accelerator backend. (possibly only when ACCEL_COMPILER is defined)

In short of that, I think placing it here before RTL expansion seems the
most well defined, even if we have to handle some optimized obscurity.

and thus
is placed in TARGET_EXPAND_TO_RTL_HOOK, as the core issue is we inherently
need something different generated between the host-fallback vs for the GPU.

(Likewise, different per each offload target.)

The new nvptx_expand_to_rtl_hook modifies the function decl type and
arguments, and scans the gimple body to remove occurrences of .omp_data_i.*
Detection of OpenACC child functions is done through "omp target entrypoint"
and "oacc function" attributes. Because OpenMP target child functions
have a more elaborate wrapper generated for them, this pass only supports
OpenACC right now.

At the Cauldron, the question indeed has been raised (Jakub, Tom) why not
enabled for OpenMP, too.  My answer was that this surely can be done, but
the change as presented here already is an improvement over the current
status ("stands on its own", as Jeff Law would call it), so I'm fine with
you handling OpenACC first, and then OpenMP can follow later (at some as
of yet indeterminite point in time, even).


The OpenMP way of wrapping the user defined GPU kernel with lots of 
initialization
code does make this much more tedious I think.

The question should actually be, can OpenMP simply do this kind of 
initialization
by the host libgomp runtime like OpenACC does, and make the nvptx kernel
proper more similar between the two?

libgomp has tested with this patch x86_64-linux (nvptx-none accelerator)
without regressions

Can you present performance numbers, too?


Haven't got to that yet.

(I'm currently undergoing more gcc tests as well).

As these changes, being confined to nvptx code only, can't possibly have
any effect on other target testing, I assume that's nvptx target testing
you're talking about?  (..., where also I'm not expecting any
disturbance.)


Yeah, I was talking about nvptx-none compiler testing. Haven't found any 
changes.

--- gcc/config/nvptx/nvptx.c    (revision 275493)

+++ gcc/config/nvptx/nvptx.c    (working copy)
+static void
+nvptx_expand_to_rtl_hook (void)
+{
+  /* For utilizing CUDA .param kernel arguments, we detect and modify
+     the gimple of offloaded child functions, here before RTL expansion,
+     starting with standard OMP form:
+      foo._omp_fn.0 (const struct .omp_data_t.8 & restrict .omp_data_i) { ... }
+
+     and transform it into a style where the OMP data record fields are
+     "exploded" into individual scalar arguments:
+      foo._omp_fn.0 (int * a, int * b, int * c) { ... }
+
+     Note that there are implicit assumptions of how OMP lowering (and/or other
+     intervening passes) behaves contained in this transformation code;
+     if those passes change in their output, this code may possibly need
+     updating.  */
+
+  if (lookup_attribute ("omp target entrypoint",
+                       DECL_ATTRIBUTES (current_function_decl))
+      /* The rather indirect manner in which OpenMP target functions are
+        launched makes this transformation only valid for OpenACC currently.
+        TODO: e.g. write_omp_entry(), nvptx_declare_function_name(), etc.
+        needs changes for this to work with OpenMP.  */
+      && lookup_attribute ("oacc function",
+                          DECL_ATTRIBUTES (current_function_decl))
+      && VOID_TYPE_P (TREE_TYPE (DECL_RESULT (current_function_decl))))

Why the 'void' return conditional?  (Or, should that rather be an
'gcc_checking_assert' at the top of the following block?)


That the shape of child functions omp-low generates. Maybe that should be an
assertion, though here I'm just doing sanity checking and ignoring otherwise.

Come to think of it, maybe I should try using the assertion to check if
I'm unintentionally ignoring transforming some cases...

+    {
+      tree omp_data_arg = DECL_ARGUMENTS (current_function_decl);
+      tree argtype = TREE_TYPE (omp_data_arg);
+
+      /* Ensure this function is of the form of a single reference argument
+        to the OMP data record, or a single void* argument (when no values
+        passed)  */
+      if (! (DECL_CHAIN (omp_data_arg) == NULL_TREE
+            && ((TREE_CODE (argtype) == REFERENCE_TYPE
+                 && TREE_CODE (TREE_TYPE (argtype)) == RECORD_TYPE)
+                || (TREE_CODE (argtype) == POINTER_TYPE
+                    && TREE_TYPE (argtype) == void_type_node))))
+       return;

Again, is that something we should 'gcc_checking_assert', so that we'll
notice when something changes/breaks?


As above.

Given your note above, "there are implicit assumptions [on] OMP
lowering", I'd assume that this code here does quite some
'gcc_checking_assert'ions to make sure that we're within the expected
bounds.

+      /* Remove local decls which correspond to *.omp_data_i->FIELD entries, by
+        scanning and skipping those entries, creating a new local_decls list.
+        We assume a very specific MEM_REF tree expression shape.  */
+      tree decl;
+      unsigned int i;
+      vec<tree, va_gc> *new_local_decls = NULL;
+      FOR_EACH_VEC_SAFE_ELT (cfun->local_decls, i, decl)
+       {
+         if (DECL_HAS_VALUE_EXPR_P (decl))
+           {
+             tree t = DECL_VALUE_EXPR (decl);
+             if (TREE_CODE (t) == MEM_REF
+                 && TREE_CODE (TREE_OPERAND (t, 0)) == COMPONENT_REF
+                 && TREE_CODE (TREE_OPERAND (TREE_OPERAND (t, 0), 0)) == 
MEM_REF
+                 && (TREE_OPERAND (TREE_OPERAND (TREE_OPERAND (t, 0), 0), 0)
+                     == omp_data_arg))
+               continue;
+           }
+         vec_safe_push (new_local_decls, decl);
+       }
+      vec_free (cfun->local_decls);
+      cfun->local_decls = new_local_decls;

Is it worth doing that manually, or can/should some dead code elimination
pass deal with that?


I think I ICE'd somewhere before adding this removal.

+      /* Scan function body for assignments from .omp_data_i->FIELD, and using
+        the above created fld_to_args hash map, convert them to reads of
+        function arguments.  */
+           else if (TREE_CODE (val) == MEM_REF
+                    && TREE_CODE (TREE_OPERAND (val, 0)) == SSA_NAME
+                    && SSA_NAME_VAR (TREE_OPERAND (val, 0)) == omp_data_arg)
+             {
+               /* This case may happen in the final tree level optimization
+                  output, due to SLP:
+                  vect.XX = MEM <vector(1) unsigned long> [(void 
*).omp_data_i_5(D) + 8B]
+
+                  Therefore here we need a more elaborate search of the field
+                  list to reverse map to which field the offset is referring
+                  to.  */

Would this be simpler if the conversion would be done earlier?  (And I
mentioned above.)


Yes, it would be much less laborious :P

+           /* If we found the corresponding OMP data record field, replace the
+              RHS with the new created PARM_DECL.  */
+           if (new_val != NULL_TREE)
+             {
+               if (dump_file)
+                 {
+                   fprintf (dump_file, "For gimple stmt: ");
+                   print_gimple_stmt (dump_file, stmt, 0);
+                   fprintf (dump_file, "\tReplacing OMP recv ref %s with %s\n",
+                            print_generic_expr_to_str (val),
+                            print_generic_expr_to_str (new_val));
+                 }
+               /* Write in looked up ARG as new RHS value.  */
+               *val_ptr = new_val;
+             }

If 'new_val == NULL_TREE' that simply means that we've been looking at
something that doesn't need to be handled here, right?


Technically, it only means we haven't found something to replace the 
.omp_data_i->FIELD ref.

+      /* Delete SSA_NAMEs of .omp_data_i by setting them to NULL_TREE.  */
+      tree name;
+      FOR_EACH_SSA_NAME (i, name, cfun)
+       if (SSA_NAME_VAR (name) == omp_data_arg)
+         (*SSANAMES (cfun))[SSA_NAME_VERSION (name)] = NULL_TREE;

Again, manual cleanup vs. automated?


The "automated" one that marked it unused didn't really remove it for some 
reason,
I forgot why, probably because it was still considered "used" in some way.
Only this manual manipulation worked.

--- libgomp/plugin/plugin-nvptx.c       (revision 275493)
+++ libgomp/plugin/plugin-nvptx.c       (working copy)
@@ -1438,78 +1374,7 @@ GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void
                                 unsigned *dims, void *targ_mem_desc,
                                 struct goacc_asyncqueue *aq)
  {
[...]
-  if (mapnum > 0)
-    GOMP_OFFLOAD_openacc_async_queue_callback (aq, cuda_free_argmem, block);
+  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, aq->cuda_stream);
  }

Wasn't that the only user of 'GOMP_OFFLOAD_openacc_async_queue_callback'?


No, gomp_map_vars uses it as a plugin hook from libgomp proper, though
this was the only instance from inside the nvptx plugin.

(another use from oacc-async.c:goacc_async_free() appears to be orphaned now,
though I think we should keep that routine for a while, as it appears to
possibly be of use)

I'll try changing some of those cases we identified to be 'assertifiable' and
see what happens.

Thanks,
Chung-Lin

Re: [PATCH, nvptx] Expand OpenACC child function arguments to use CUDA params space

Reply via email to