On 12/18/2017 02:58 PM, Cesar Philippidis wrote:
> Jakub,
> 
> I'd like your thoughts on the following problem.
> 
> One of the offloading bottlenecks with GPU acceleration in OpenACC is
> the nontrivial offloaded function invocation overhead. At present, GCC
> generates code to pass a struct containing one field for each of the
> data mappings used in the OMP child function. I'm guessing a struct is
> used because pthread_create only accepts a single for new threads. What
> I'd like to do is to create the child function with one argument per
> data mapping. This has a number of advantages:
> 
>   1. No device memory needs to be managed for the child function data
>      mapping struct.
> 
>   2. On PTX targets, the .param address space is cached. Using
>      individual parameters for function arguments will allow the nvptx
>      back end to generate a more relaxed "execution model" because the
>      thread initialization code will be accessing cache memory instead
>      of global memory.
> 
>   3. It was my hope that this would set a path to eliminate the
>      GOMP_MAP_FIRSTPRIVATE_INT optimization, by replacing those mappings
>      with the actual value directly.
> 
> 1) is huge for programs, such as cloverleaf, which launch a lot of small
> parallel regions a lot of times.
> 
> For the execution model in 2), OpenACC begins each parallel region in a
> gang-redundant, worker-single and vector-single state. To transition
> from a single-threaded (or single vector lane) state to a multi-threaded
> partitioned state, GCC needs to emit code to propagate live variables,
> both on the stack and registers to the spawned threads. A lot of loops,
> including DGEMV from BLAS, can be executed in a fully-redundant state.
> Executing code redundantly has the advantage of not requiring any state
> transition code. The problem here is that because a) the struct is in
> global memory, and b) not all of the GPU threads are executing the same
> instruction at the same time. Consequently, initializing each thread in
> a fully redundant manner actually hurts performance. When I rewrote the
> same test case passing the data mappings via individual parameters, that
> optimization improved performance compared to GCC trunk's baseline.
> 
> Lastly, 3) is more of a simplification than anything else. I'm not too
> concerned about this because those variables only get initialized once.
> So long as they don't require a separate COPYIN data mapping, the
> performance hit should be negligible.
> 
> In this first attempt at using parameters I taught lower_omp_target how
> to create child functions for OpenACC parallel regions with individual
> parameters for the data mappings instead of using a large struct. This
> works for the most part, but I realized too late that pthread_create
> only passes one argument to each thread it creates. It should be noted
> that I left the kernels implementation as-is, using the global struct
> argument because kernels in GCC is largely ineffective and it usually
> falls back to executing code on the host CPU. Eventually, we want to
> redo kernels, but not until we get the parallel code running efficiently.
> 
> For fallback host targets, libgomp is using libffi to pass arguments to
> the offloaded functions. This works OK at the moment because the host
> code is always single-threaded. Ideally, that would change in the
> future, but I'm not aware of any immediate plans to do so.
> 
> Question: is this approach acceptable for Stage 1 in May, or should I
> make the offloaded function parameter expansion target-specific? I can
> think a couple of ways to make this target-specific:
> 
>   a. Create two child functions during lowering, one with individual
>      parameters for the data mappings, and another which takes in a
>      single struct. The latter then calls the former immediately on
>      on entry.
> 
>   b. Teach oaccdevlow to expand the incoming struct into individual
>      parameters.
> 
> I'm concerned that b) is going to be a large pass. The SRA pass is
> somewhat large at 5k. While this should be simpler, I'm not sure by how
> much (probably a lot because it won't need to preform as much analysis).
> 
> While this patch is functional, it's not complete. I still need to tweak
> a couple of things in the runtime. But I don't want to spend too much
> time on it if we decide to go with a different approach.
> 
> Any thoughts are welcome.
> 
> By the way, next we'll be working on increasing vector_length on nvptx
> targets. In conjunction with that, we'll simplifying the OpenACC
> execution model in the nvptx BE, along with adding a new reduction
> finalizer.

After thinking about this some more, I decided that it would be better
expand the offloaded function arguments into individual parameters
during omp lowering, rather than writing a separate pass later on. I
don't see too many disadvantages of using libffi after a pthread is
spawned by the host. If anything, the pthread's use of libffi is
equivalent of preforming SRA by the accelerator anyway.

I've committed this patch to openacc-gcc-7-branch.

Note that I had to xfail
libgomp.oacc-c-c++-common/combined-directives-1.c because I disabled
struct analysis analysis on parallel regions. Unfortunately, that makes
kernels slightly less effective. But more often than not, kernels
regions fall back to host execution anyway.

Cesar

2017-12-21  Cesar Philippidis  <ce...@codesourcery.com>

	Makefile.def: Make libgomp depend on libffi.
	configure.ac: Likewise.
	Makefile.in: Regenerate.
	configure: Regenerate.

	gcc/fortran/
	* types.def: (BF_FN_VOID_INT_INT_OMPFN_SIZE_PTR_PTR_PTR_VAR):
	Define.

	gcc/
	* builtin-types.def (BF_FN_VOID_INT_INT_OMPFN_SIZE_PTR_PTR_PTR_VAR):
	Define.
	* config/nvptx/nvptx.c (nvptx_expand_cmp_swap): Handle PARM_DECLs.
	* omp-builtins.def (BUILD_IN_GOACC_PARALLEL): Call
	GOACC_parallel_keyed_v2.
	* omp-expand.c (expand_omp_target): Update call to
	BUILT_IN_GOACC_PARALLEL.
	* omp-low.c (struct omp_context): Add parm_map member.
	(lookup_parm): New function.
	(build_receiver_ref): Lookup parm_map decls.
	(install_parm_decl): New function.
	(install_var_field): Install parm_map decl for OpenACC parallel region
	data clauses.
	(delete_omp_context): Clean parm_map.
	(scan_sharing_clauses): Install subarray variable mapping into parm_map.
	(create_omp_child_function): Defer creation of child function for
	OpenACC parallel regions.
	(scan_omp_target): Likewise.
	(append_decl_arg): New function.
	(lower_omp_target): Create an child offloaded function using one
	parameter per data mapping for OpenACC parallel regions.
	* tree-ssa-structalias.c (find_func_aliases_for_builtin_call):
	Ignore OpenACC parallel regions.
	(find_func_clobbers): Likewise.
	(ipa_pta_execute): Likewise.

	libgomp/
	* Makefile.am: Add libffi build dependency.
	* configure.ac: Likewise.
	* Makefile.in: Regenerate.
	* config.h.in: Regenerate.
	* configure: Regenerate.
	* libgomp-plugin.h: Define GOMP_OFFLOAD_openacc_exec_params and
	GOMP_OFFLOAD_openacc_async_exec_params.
	* libgomp.h (acc_dispatch_t): Use them here. 
	* libgomp.map (GOACC_parallel_keyed_v2): Declare.
	* libgomp_g.h (GOACC_parallel_keyed_v2): Likewise.
	* oacc-host.c (host_openacc_exec_params): New function.
	(host_openacc_async_exec_params): Likewise.
	* oacc-parallel.c (goacc_call_host_fn): Likewise.
	(GOACC_parallel_keyed_internal): Likewise.
	(GOACC_parallel_keyed): Wrapper for GOACC_parallel_keyed_internal.
	(GOACC_parallel_keyed_v2): Likewise.
	* plugin/plugin-nvptx.c (nvptx_exec): Replace CUDeviceptr dp parameter
	with void **kargs.
	(openacc_exec_internal): New function.
	(GOMP_OFFLOAD_openacc_exec_params): New function.
	(GOMP_OFFLOAD_openacc_exec): Update to call openacc_exec_internal.
	(openacc_async_exec_internal): New function.
	(GOMP_OFFLOAD_openacc_async_exec_params): New function.
	(GOMP_OFFLOAD_openacc_async_exec): Update call to
	openacc_async_exec_internal.
	* target.c (gomp_load_plugin_for_device): Handle
	openacc_exec_params and openacc_async_exec_params.
	* testsuite/Makefile.in: Regenerate.
	* testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c:
	Xfail on offloaded targets.


diff --git a/Makefile.def b/Makefile.def
index abfa9efe959..5e94062fa75 100644
--- a/Makefile.def
+++ b/Makefile.def
@@ -550,6 +550,7 @@ dependencies = { module=configure-target-libgo; on=all-target-libstdc++-v3; };
 dependencies = { module=all-target-libgo; on=all-target-libbacktrace; };
 dependencies = { module=all-target-libgo; on=all-target-libffi; };
 dependencies = { module=all-target-libgo; on=all-target-libatomic; };
+dependencies = { module=configure-target-libgomp; on=configure-target-libffi; };
 dependencies = { module=configure-target-libstdc++-v3; on=configure-target-libgomp; };
 dependencies = { module=configure-target-liboffloadmic; on=configure-target-libgomp; };
 dependencies = { module=configure-target-libsanitizer; on=all-target-libstdc++-v3; };
@@ -564,6 +565,7 @@ dependencies = { module=install-target-libgo; on=install-target-libatomic; };
 dependencies = { module=install-target-libgfortran; on=install-target-libquadmath; };
 dependencies = { module=install-target-libgfortran; on=install-target-libgcc; };
 dependencies = { module=install-target-libsanitizer; on=install-target-libstdc++-v3; };
+dependencies = { module=install-target-libgomp; on=install-target-libffi; };
 dependencies = { module=install-target-libsanitizer; on=install-target-libgcc; };
 dependencies = { module=install-target-libvtv; on=install-target-libstdc++-v3; };
 dependencies = { module=install-target-libvtv; on=install-target-libgcc; };
diff --git a/Makefile.in b/Makefile.in
index b824e0a0ca1..9b4497e3943 100644
--- a/Makefile.in
+++ b/Makefile.in
@@ -55803,6 +55803,7 @@ configure-target-libgo: maybe-all-target-libstdc++-v3
 all-target-libgo: maybe-all-target-libbacktrace
 all-target-libgo: maybe-all-target-libffi
 all-target-libgo: maybe-all-target-libatomic
+configure-target-libgomp: maybe-configure-target-libffi
 configure-target-libstdc++-v3: maybe-configure-target-libgomp
 
 configure-stage1-target-libstdc++-v3: maybe-configure-stage1-target-libgomp
@@ -55849,6 +55850,7 @@ install-target-libgo: maybe-install-target-libatomic
 install-target-libgfortran: maybe-install-target-libquadmath
 install-target-libgfortran: maybe-install-target-libgcc
 install-target-libsanitizer: maybe-install-target-libstdc++-v3
+install-target-libgomp: maybe-install-target-libffi
 install-target-libsanitizer: maybe-install-target-libgcc
 install-target-libvtv: maybe-install-target-libstdc++-v3
 install-target-libvtv: maybe-install-target-libgcc
diff --git a/configure b/configure
index 32a38633ad8..ed47944d8f9 100755
--- a/configure
+++ b/configure
@@ -3472,11 +3472,19 @@ case "${target}" in
   ft32-*-*)
     noconfigdirs="$noconfigdirs target-libffi"
     ;;
+  nvptx-*-*)
+    noconfigdirs="$noconfigdirs target-libffi"
+    ;;
   *-*-lynxos*)
     noconfigdirs="$noconfigdirs target-libffi"
     ;;
 esac
 
+libgomp_deps="target-libffi"
+if echo " ${noconfigdirs} " | grep " target-libffi " > /dev/null 2>&1 ; then
+   libgomp_deps=""
+fi
+
 # Disable the go frontend on systems where it is known to not work. Please keep
 # this in sync with contrib/config-list.mk.
 case "${target}" in
@@ -6460,6 +6468,15 @@ esac
 # $build_configdirs and $target_configdirs.
 # If we have the source for $noconfigdirs entries, add them to $notsupp.
 
+# libgomp depends on libffi.  Remove it from nonsupp if necessary.
+if ! (echo " $noconfigdirs " | grep " target-libgomp " >/dev/null 2>&1); then
+  if echo " $noconfigdirs " | grep " target-libffi " >/dev/null 2>&1; then
+    if test "x${libgomp_deps}" != x; then
+      noconfigdirs=`echo " $noconfigdirs " | sed -e "s/ target-libffi / /"`
+    fi
+  fi
+fi
+
 notsupp=""
 for dir in . $skipdirs $noconfigdirs ; do
   dirname=`echo $dir | sed -e s/target-//g -e s/build-//g`
diff --git a/configure.ac b/configure.ac
index 12377499295..a3b9e116a05 100644
--- a/configure.ac
+++ b/configure.ac
@@ -800,11 +800,19 @@ case "${target}" in
   ft32-*-*)
     noconfigdirs="$noconfigdirs target-libffi"
     ;;
+  nvptx-*-*)
+    noconfigdirs="$noconfigdirs target-libffi"
+    ;;
   *-*-lynxos*)
     noconfigdirs="$noconfigdirs target-libffi"
     ;;
 esac
 
+libgomp_deps="target-libffi"
+if echo " ${noconfigdirs} " | grep " target-libffi " > /dev/null 2>&1 ; then
+   libgomp_deps=""
+fi
+
 # Disable the go frontend on systems where it is known to not work. Please keep
 # this in sync with contrib/config-list.mk.
 case "${target}" in
@@ -2127,6 +2135,15 @@ esac
 # $build_configdirs and $target_configdirs.
 # If we have the source for $noconfigdirs entries, add them to $notsupp.
 
+# libgomp depends on libffi.  Remove it from nonsupp if necessary.
+if ! (echo " $noconfigdirs " | grep " target-libgomp " >/dev/null 2>&1); then
+  if echo " $noconfigdirs " | grep " target-libffi " >/dev/null 2>&1; then
+    if test "x${libgomp_deps}" != x; then
+      noconfigdirs=`echo " $noconfigdirs " | sed -e "s/ target-libffi / /"`
+    fi
+  fi
+fi
+
 notsupp=""
 for dir in . $skipdirs $noconfigdirs ; do
   dirname=`echo $dir | sed -e s/target-//g -e s/build-//g`
diff --git a/gcc/builtin-types.def b/gcc/builtin-types.def
index ac9894467ec..7f647c65162 100644
--- a/gcc/builtin-types.def
+++ b/gcc/builtin-types.def
@@ -763,6 +763,10 @@ DEF_FUNCTION_TYPE_VAR_6 (BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_VAR,
 			 BT_VOID, BT_INT, BT_PTR_FN_VOID_PTR, BT_SIZE,
 			 BT_PTR, BT_PTR, BT_PTR)
 
+DEF_FUNCTION_TYPE_VAR_7 (BT_FN_VOID_INT_INT_OMPFN_SIZE_PTR_PTR_PTR_VAR,
+			 BT_VOID, BT_INT, BT_INT, BT_PTR_FN_VOID_PTR, BT_SIZE,
+			 BT_PTR, BT_PTR, BT_PTR)
+
 DEF_FUNCTION_TYPE_VAR_7 (BT_FN_VOID_INT_SIZE_PTR_PTR_PTR_INT_INT_VAR,
 			 BT_VOID, BT_INT, BT_SIZE, BT_PTR, BT_PTR,
 			 BT_PTR, BT_INT, BT_INT)
diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index a7b4c09bf6c..55c7e3cbf90 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -4737,6 +4737,10 @@ nvptx_expand_cmp_swap (tree exp, rtx target,
 			 NULL_RTX, mode, EXPAND_NORMAL);
   rtx pat;
 
+  /* 'mem' might be a PARM_DECL.  If so, convert it to a register.  */
+  if (!REG_P (mem))
+    mem = copy_to_mode_reg (GET_MODE (mem), mem);
+
   mem = gen_rtx_MEM (mode, mem);
   if (!REG_P (cmp))
     cmp = copy_to_mode_reg (mode, cmp);
diff --git a/gcc/fortran/types.def b/gcc/fortran/types.def
index 1f8a5a1277c..3c3ad69d848 100644
--- a/gcc/fortran/types.def
+++ b/gcc/fortran/types.def
@@ -252,3 +252,7 @@ DEF_FUNCTION_TYPE_VAR_7 (BT_FN_VOID_INT_SIZE_PTR_PTR_PTR_INT_INT_VAR,
 DEF_FUNCTION_TYPE_VAR_6 (BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_VAR,
 			  BT_VOID, BT_INT, BT_PTR_FN_VOID_PTR, BT_SIZE,
 			  BT_PTR, BT_PTR, BT_PTR)
+
+DEF_FUNCTION_TYPE_VAR_7 (BT_FN_VOID_INT_INT_OMPFN_SIZE_PTR_PTR_PTR_VAR,
+			  BT_VOID, BT_INT, BT_INT, BT_PTR_FN_VOID_PTR, BT_SIZE,
+			  BT_PTR, BT_PTR, BT_PTR)
diff --git a/gcc/omp-builtins.def b/gcc/omp-builtins.def
index 69b73f4b8c4..a9ec667aa54 100644
--- a/gcc/omp-builtins.def
+++ b/gcc/omp-builtins.def
@@ -38,8 +38,8 @@ DEF_GOACC_BUILTIN (BUILT_IN_GOACC_DATA_END, "GOACC_data_end",
 DEF_GOACC_BUILTIN (BUILT_IN_GOACC_ENTER_EXIT_DATA, "GOACC_enter_exit_data",
 		   BT_FN_VOID_INT_SIZE_PTR_PTR_PTR_INT_INT_VAR,
 		   ATTR_NOTHROW_LIST)
-DEF_GOACC_BUILTIN (BUILT_IN_GOACC_PARALLEL, "GOACC_parallel_keyed",
-		   BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_VAR,
+DEF_GOACC_BUILTIN (BUILT_IN_GOACC_PARALLEL, "GOACC_parallel_keyed_v2",
+		   BT_FN_VOID_INT_INT_OMPFN_SIZE_PTR_PTR_PTR_VAR,
 		   ATTR_NOTHROW_LIST)
 DEF_GOACC_BUILTIN (BUILT_IN_GOACC_UPDATE, "GOACC_update",
 		   BT_FN_VOID_INT_SIZE_PTR_PTR_PTR_INT_INT_VAR,
diff --git a/gcc/omp-expand.c b/gcc/omp-expand.c
index bf1f127d8d6..f674c74ec82 100644
--- a/gcc/omp-expand.c
+++ b/gcc/omp-expand.c
@@ -7097,19 +7097,21 @@ expand_omp_target (struct omp_region *region)
   gomp_target *entry_stmt;
   gimple *stmt;
   edge e;
-  bool offloaded, data_region;
+  bool offloaded, data_region, oacc_parallel;
 
   entry_stmt = as_a <gomp_target *> (last_stmt (region->entry));
   new_bb = region->entry;
+  oacc_parallel = false;
 
   offloaded = is_gimple_omp_offloaded (entry_stmt);
   switch (gimple_omp_target_kind (entry_stmt))
     {
+    case GF_OMP_TARGET_KIND_OACC_PARALLEL:
+      oacc_parallel = true;
     case GF_OMP_TARGET_KIND_REGION:
     case GF_OMP_TARGET_KIND_UPDATE:
     case GF_OMP_TARGET_KIND_ENTER_DATA:
     case GF_OMP_TARGET_KIND_EXIT_DATA:
-    case GF_OMP_TARGET_KIND_OACC_PARALLEL:
     case GF_OMP_TARGET_KIND_OACC_KERNELS:
     case GF_OMP_TARGET_KIND_OACC_UPDATE:
     case GF_OMP_TARGET_KIND_OACC_ENTER_EXIT_DATA:
@@ -7171,7 +7173,7 @@ expand_omp_target (struct omp_region *region)
 	 .OMP_DATA_I may have been converted into a different local
 	 variable.  In which case, we need to keep the assignment.  */
       tree data_arg = gimple_omp_target_data_arg (entry_stmt);
-      if (data_arg)
+      if (data_arg && !oacc_parallel)
 	{
 	  basic_block entry_succ_bb = single_succ (entry_bb);
 	  gimple_stmt_iterator gsi;
@@ -7489,6 +7491,11 @@ expand_omp_target (struct omp_region *region)
   /* The maximum number used by any start_ix, without varargs.  */
   auto_vec<tree, 11> args;
   args.quick_push (device);
+  if (start_ix == BUILT_IN_GOACC_PARALLEL)
+    {
+      tree use_params = oacc_parallel ? integer_one_node : integer_zero_node;
+      args.quick_push (use_params);
+    }
   if (offloaded)
     args.quick_push (build_fold_addr_expr (child_fn));
   args.quick_push (t1);
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index e790f0f1bb2..a2869e49ebd 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -89,6 +89,7 @@ struct omp_context
   /* Map variables to fields in a structure that allows communication
      between sending and receiving threads.  */
   splay_tree field_map;
+  splay_tree parm_map;
   tree record_type;
   tree sender_decl;
   tree receiver_decl;
@@ -321,6 +322,14 @@ maybe_lookup_decl (const_tree var, omp_context *ctx)
 }
 
 static inline tree
+lookup_parm (const_tree var, omp_context *ctx)
+{
+  splay_tree_node n;
+  n = splay_tree_lookup (ctx->parm_map, (splay_tree_key) var);
+  return (tree) n->value;
+}
+
+static inline tree
 lookup_field (tree var, omp_context *ctx)
 {
   splay_tree_node n;
@@ -501,15 +510,21 @@ build_receiver_ref (tree var, bool by_ref, omp_context *ctx)
 {
   tree x, field = lookup_field (var, ctx);
 
-  /* If the receiver record type was remapped in the child function,
-     remap the field into the new record type.  */
-  x = maybe_lookup_field (field, ctx);
-  if (x != NULL)
-    field = x;
+  if (is_oacc_parallel (ctx))
+    x = lookup_parm (var, ctx);
+  else
+    {
+      /* If the receiver record type was remapped in the child function,
+	 remap the field into the new record type.  */
+      x = maybe_lookup_field (field, ctx);
+      if (x != NULL)
+	field = x;
+
+      x = build_simple_mem_ref (ctx->receiver_decl);
+      TREE_THIS_NOTRAP (x) = 1;
+      x = omp_build_component_ref (x, field);
+    }
 
-  x = build_simple_mem_ref (ctx->receiver_decl);
-  TREE_THIS_NOTRAP (x) = 1;
-  x = omp_build_component_ref (x, field);
   if (by_ref)
     {
       x = build_simple_mem_ref (x);
@@ -644,6 +659,32 @@ build_sender_ref (tree var, omp_context *ctx)
   return build_sender_ref ((splay_tree_key) var, ctx);
 }
 
+static void
+install_parm_decl (tree var, tree type, omp_context *ctx)
+{
+  if (!is_oacc_parallel (ctx))
+    return;
+
+  splay_tree_key key = (splay_tree_key) var;
+  tree decl_name = NULL_TREE, t;
+  location_t loc = UNKNOWN_LOCATION;
+
+  if (DECL_P (var))
+    {
+      decl_name = get_identifier (get_name (var));
+      loc = DECL_SOURCE_LOCATION (var);
+    }
+  t = build_decl (loc, PARM_DECL, decl_name, type);
+  DECL_ARTIFICIAL (t) = 1;
+  DECL_NAMELESS (t) = 1;
+  DECL_ARG_TYPE (t) = type;
+  DECL_CONTEXT (t) = current_function_decl;
+  TREE_USED (t) = 1;
+  TREE_READONLY (t) = 1;
+
+  splay_tree_insert (ctx->parm_map, key, (splay_tree_value) t);
+}
+
 /* Add a new field for VAR inside the structure CTX->SENDER_DECL.  If
    BASE_POINTERS_RESTRICT, declare the field with restrict.  */
 
@@ -764,7 +805,10 @@ install_var_field (tree var, bool by_ref, int mask, omp_context *ctx,
     }
 
   if (mask & 1)
-    splay_tree_insert (ctx->field_map, key, (splay_tree_value) field);
+    {
+      splay_tree_insert (ctx->field_map, key, (splay_tree_value) field);
+      install_parm_decl (var, type, ctx);
+    }
   if ((mask & 2) && ctx->sfield_map)
     splay_tree_insert (ctx->sfield_map, key, (splay_tree_value) sfield);
 }
@@ -1068,6 +1112,8 @@ delete_omp_context (splay_tree_value value)
     splay_tree_delete (ctx->field_map);
   if (ctx->sfield_map)
     splay_tree_delete (ctx->sfield_map);
+  if (ctx->parm_map)
+    splay_tree_delete (ctx->parm_map);
 
   /* We hijacked DECL_ABSTRACT_ORIGIN earlier.  We need to clear it before
      it produces corrupt debug information.  */
@@ -1506,6 +1552,7 @@ scan_sharing_clauses (tree clauses, omp_context *ctx,
 		  insert_field_into_struct (ctx->record_type, field);
 		  splay_tree_insert (ctx->field_map, (splay_tree_key) decl,
 				     (splay_tree_value) field);
+		  install_parm_decl (decl, ptr_type_node, ctx);
 		}
 	    }
 	  break;
@@ -1800,10 +1847,13 @@ omp_maybe_offloaded_ctx (omp_context *ctx)
 }
 
 /* Build a decl for the omp child function.  It'll not contain a body
-   yet, just the bare decl.  */
+   yet, just the bare decl.  Unlike omp child functions, acc child
+   functions for parallel regions have one argument per data
+   mapping.  */
 
 static void
-create_omp_child_function (omp_context *ctx, bool task_copy)
+create_omp_child_function (omp_context *ctx, bool task_copy,
+			   unsigned int map_cnt = 0)
 {
   tree decl, type, name, t;
 
@@ -1825,6 +1875,13 @@ create_omp_child_function (omp_context *ctx, bool task_copy)
       type = build_function_type_list (void_type_node, ptr_type_node,
 				       cilk_var_type, cilk_var_type, NULL_TREE);
     }
+  else if (is_oacc_parallel (ctx))
+    {
+      tree *arg_types = (tree *) alloca (sizeof (tree) * map_cnt);
+      for (unsigned int i = 0; i < map_cnt; i++)
+	arg_types[i] = ptr_type_node;
+      type = build_function_type_array (void_type_node, map_cnt, arg_types);
+    }
   else
     type = build_function_type_list (void_type_node, ptr_type_node, NULL_TREE);
 
@@ -1899,35 +1956,37 @@ create_omp_child_function (omp_context *ctx, bool task_copy)
       DECL_ARGUMENTS (decl) = t;
     }
 
-  tree data_name = get_identifier (".omp_data_i");
-  t = build_decl (DECL_SOURCE_LOCATION (decl), PARM_DECL, data_name,
-		  ptr_type_node);
-  DECL_ARTIFICIAL (t) = 1;
-  DECL_NAMELESS (t) = 1;
-  DECL_ARG_TYPE (t) = ptr_type_node;
-  DECL_CONTEXT (t) = current_function_decl;
-  TREE_USED (t) = 1;
-  TREE_READONLY (t) = 1;
-  if (cilk_for_count)
-    DECL_CHAIN (t) = DECL_ARGUMENTS (decl);
-  DECL_ARGUMENTS (decl) = t;
-  if (!task_copy)
-    ctx->receiver_decl = t;
-  else
+  if (!is_oacc_parallel (ctx))
     {
-      t = build_decl (DECL_SOURCE_LOCATION (decl),
-		      PARM_DECL, get_identifier (".omp_data_o"),
+      tree data_name = get_identifier (".omp_data_i");
+      t = build_decl (DECL_SOURCE_LOCATION (decl), PARM_DECL, data_name,
 		      ptr_type_node);
       DECL_ARTIFICIAL (t) = 1;
       DECL_NAMELESS (t) = 1;
       DECL_ARG_TYPE (t) = ptr_type_node;
       DECL_CONTEXT (t) = current_function_decl;
       TREE_USED (t) = 1;
-      TREE_ADDRESSABLE (t) = 1;
-      DECL_CHAIN (t) = DECL_ARGUMENTS (decl);
+      TREE_READONLY (t) = 1;
+      if (cilk_for_count)
+	DECL_CHAIN (t) = DECL_ARGUMENTS (decl);
       DECL_ARGUMENTS (decl) = t;
+      if (!task_copy)
+	ctx->receiver_decl = t;
+      else
+	{
+	  t = build_decl (DECL_SOURCE_LOCATION (decl),
+			  PARM_DECL, get_identifier (".omp_data_o"),
+			  ptr_type_node);
+	  DECL_ARTIFICIAL (t) = 1;
+	  DECL_NAMELESS (t) = 1;
+	  DECL_ARG_TYPE (t) = ptr_type_node;
+	  DECL_CONTEXT (t) = current_function_decl;
+	  TREE_USED (t) = 1;
+	  TREE_ADDRESSABLE (t) = 1;
+	  DECL_CHAIN (t) = DECL_ARGUMENTS (decl);
+	  DECL_ARGUMENTS (decl) = t;
+	}
     }
-
   /* Allocate memory for the function structure.  The call to
      allocate_struct_function clobbers CFUN, so we need to restore
      it afterward.  */
@@ -2608,6 +2667,7 @@ scan_omp_target (gomp_target *stmt, omp_context *outer_ctx)
 
   ctx = new_omp_context (stmt, outer_ctx);
   ctx->field_map = splay_tree_new (splay_tree_compare_pointers, 0, 0);
+  ctx->parm_map = splay_tree_new (splay_tree_compare_pointers, 0, 0);
   ctx->default_kind = OMP_CLAUSE_DEFAULT_SHARED;
   ctx->record_type = lang_hooks.types.make_type (RECORD_TYPE);
   name = create_tmp_var_name (".omp_data_t");
@@ -2621,8 +2681,11 @@ scan_omp_target (gomp_target *stmt, omp_context *outer_ctx)
   bool base_pointers_restrict = false;
   if (offloaded)
     {
-      create_omp_child_function (ctx, false);
-      gimple_omp_target_set_child_fn (stmt, ctx->cb.dst_fn);
+      if (!is_oacc_parallel (ctx))
+	{
+	  create_omp_child_function (ctx, false);
+	  gimple_omp_target_set_child_fn (stmt, ctx->cb.dst_fn);
+	}
 
       base_pointers_restrict = omp_target_base_pointers_restrict_p (clauses);
       if (base_pointers_restrict
@@ -7921,6 +7984,18 @@ convert_from_firstprivate_int (tree var, tree orig_type, bool is_ref,
   return var;
 }
 
+static tree
+append_decl_arg (tree var, tree decl_args, omp_context *ctx)
+{
+  if (!is_oacc_parallel (ctx))
+    return NULL_TREE;
+
+  tree temp = lookup_parm (var, ctx);
+  DECL_CHAIN (temp) = decl_args;
+
+  return temp;
+}
+
 /* Lower the GIMPLE_OMP_TARGET in the current statement
    in GSI_P.  CTX holds context information for the directive.  */
 
@@ -7934,7 +8009,7 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
   gimple_seq tgt_body, olist, ilist, fplist, new_body;
   location_t loc = gimple_location (stmt);
   bool offloaded, data_region;
-  unsigned int map_cnt = 0;
+  unsigned int map_cnt = 0, init_cnt = 0;
 
   offloaded = is_gimple_omp_offloaded (stmt);
   switch (gimple_omp_target_kind (stmt))
@@ -7980,11 +8055,83 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
     }
   else if (data_region)
     tgt_body = gimple_omp_body (stmt);
-  child_fn = ctx->cb.dst_fn;
 
   push_gimplify_context ();
   fplist = NULL;
 
+  /* Determine init_cnt to finish initialize ctx.  */
+
+  if (is_oacc_parallel (ctx))
+    {
+      for (c = clauses; c ; c = OMP_CLAUSE_CHAIN (c))
+	switch (OMP_CLAUSE_CODE (c))
+	  {
+	    tree var;
+
+	  default:
+	    break;
+	  case OMP_CLAUSE_MAP:
+	  case OMP_CLAUSE_TO:
+	  case OMP_CLAUSE_FROM:
+	  init_oacc_firstprivate:
+	    var = OMP_CLAUSE_DECL (c);
+	    if (!DECL_P (var))
+	      {
+		if (OMP_CLAUSE_CODE (c) != OMP_CLAUSE_MAP
+		    || (!OMP_CLAUSE_MAP_ZERO_BIAS_ARRAY_SECTION (c)
+			&& (OMP_CLAUSE_MAP_KIND (c)
+			    != GOMP_MAP_FIRSTPRIVATE_POINTER)))
+		  init_cnt++;
+		continue;
+	      }
+
+	    if (DECL_SIZE (var)
+		&& TREE_CODE (DECL_SIZE (var)) != INTEGER_CST)
+	      {
+		tree var2 = DECL_VALUE_EXPR (var);
+		gcc_assert (TREE_CODE (var2) == INDIRECT_REF);
+		var2 = TREE_OPERAND (var2, 0);
+		gcc_assert (DECL_P (var2));
+		var = var2;
+	      }
+
+	    if (offloaded
+		&& OMP_CLAUSE_CODE (c) == OMP_CLAUSE_MAP
+		&& (OMP_CLAUSE_MAP_KIND (c) == GOMP_MAP_FIRSTPRIVATE_POINTER
+		    || (OMP_CLAUSE_MAP_KIND (c)
+			== GOMP_MAP_FIRSTPRIVATE_REFERENCE)))
+	      {
+		continue;
+	      }
+
+	    if (!maybe_lookup_field (var, ctx))
+	      continue;
+
+	    init_cnt++;
+	    break;
+
+	  case OMP_CLAUSE_FIRSTPRIVATE:
+	    if (is_oacc_parallel (ctx))
+	      goto init_oacc_firstprivate;
+	    init_cnt++;
+	    break;
+
+	  case OMP_CLAUSE_USE_DEVICE_PTR:
+	  case OMP_CLAUSE_IS_DEVICE_PTR:
+	    init_cnt++;
+	    break;
+	  }
+
+      /* Initialize the offloaded child function.  */
+
+      create_omp_child_function (ctx, false, init_cnt);
+      gimple_omp_target_set_child_fn (stmt, ctx->cb.dst_fn);
+    }
+
+  child_fn = ctx->cb.dst_fn;
+
+  /* Clause Pass 1: Scan and prepare sender decls VALUE_EXPRs for
+     usage on the child function.  */
   for (c = clauses; c ; c = OMP_CLAUSE_CHAIN (c))
     switch (OMP_CLAUSE_CODE (c))
       {
@@ -8247,6 +8394,8 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 
   if (offloaded)
     {
+      if (is_oacc_parallel (ctx))
+	gcc_assert (init_cnt == map_cnt);
       target_nesting_level++;
       lower_omp (&tgt_body, ctx);
       target_nesting_level--;
@@ -8293,6 +8442,7 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
       vec_alloc (vsize, map_cnt);
       vec_alloc (vkind, map_cnt);
       unsigned int map_idx = 0;
+      tree decl_args = NULL_TREE;
 
       for (c = clauses; c ; c = OMP_CLAUSE_CHAIN (c))
 	switch (OMP_CLAUSE_CODE (c))
@@ -8488,6 +8638,7 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 	    if (s == NULL_TREE)
 	      s = integer_one_node;
 	    s = fold_convert (size_type_node, s);
+	    decl_args = append_decl_arg (ovar, decl_args, ctx);
 	    purpose = size_int (map_idx++);
 	    CONSTRUCTOR_APPEND_ELT (vsize, purpose, s);
 	    if (TREE_CODE (s) != INTEGER_CST)
@@ -8628,6 +8779,7 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 	    else
 	      s = TYPE_SIZE_UNIT (TREE_TYPE (ovar));
 	    s = fold_convert (size_type_node, s);
+	    decl_args = append_decl_arg (ovar, decl_args, ctx);
 	    purpose = size_int (map_idx++);
 	    CONSTRUCTOR_APPEND_ELT (vsize, purpose, s);
 	    if (TREE_CODE (s) != INTEGER_CST)
@@ -8667,6 +8819,7 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 	      }
 	    gimplify_assign (x, var, &ilist);
 	    s = size_int (0);
+	    decl_args = append_decl_arg (ovar, decl_args, ctx);
 	    purpose = size_int (map_idx++);
 	    CONSTRUCTOR_APPEND_ELT (vsize, purpose, s);
 	    gcc_checking_assert (tkind
@@ -8679,6 +8832,8 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 	  }
 
       gcc_assert (map_idx == map_cnt);
+      if (is_oacc_parallel (ctx))
+	DECL_ARGUMENTS (child_fn) = nreverse (decl_args);
 
       DECL_INITIAL (TREE_VEC_ELT (t, 1))
 	= build_constructor (TREE_TYPE (TREE_VEC_ELT (t, 1)), vsize);
@@ -8717,9 +8872,12 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
     {
       t = build_fold_addr_expr_loc (loc, ctx->sender_decl);
       /* fixup_child_record_type might have changed receiver_decl's type.  */
-      t = fold_convert_loc (loc, TREE_TYPE (ctx->receiver_decl), t);
-      gimple_seq_add_stmt (&new_body,
-	  		   gimple_build_assign (ctx->receiver_decl, t));
+      if (!is_oacc_parallel (ctx))
+	{
+	  t = fold_convert_loc (loc, TREE_TYPE (ctx->receiver_decl), t);
+	  gimple_seq_add_stmt (&new_body,
+			       gimple_build_assign (ctx->receiver_decl, t));
+	}
     }
   gimple_seq_add_seq (&new_body, fplist);
 
diff --git a/gcc/tree-ssa-structalias.c b/gcc/tree-ssa-structalias.c
index aab6821e792..c23ddeb9c86 100644
--- a/gcc/tree-ssa-structalias.c
+++ b/gcc/tree-ssa-structalias.c
@@ -4618,6 +4618,7 @@ find_func_aliases_for_builtin_call (struct function *fn, gcall *t)
       case BUILT_IN_GOMP_PARALLEL:
       case BUILT_IN_GOACC_PARALLEL:
 	{
+	  bool oacc_parallel = false;
 	  if (in_ipa_mode)
 	    {
 	      unsigned int fnpos, argpos;
@@ -4631,13 +4632,17 @@ find_func_aliases_for_builtin_call (struct function *fn, gcall *t)
 		case BUILT_IN_GOACC_PARALLEL:
 		  /* __builtin_GOACC_parallel (device, fn, mapnum, hostaddrs,
 					       sizes, kinds, ...).  */
-		  fnpos = 1;
-		  argpos = 3;
+		  fnpos = 2;
+		  argpos = 4;
+		  oacc_parallel = gimple_call_arg (t, 1) == integer_one_node;
 		  break;
 		default:
 		  gcc_unreachable ();
 		}
 
+	      if (oacc_parallel)
+		break;
+
 	      tree fnarg = gimple_call_arg (t, fnpos);
 	      gcc_assert (TREE_CODE (fnarg) == ADDR_EXPR);
 	      tree fndecl = TREE_OPERAND (fnarg, 0);
@@ -5195,6 +5200,7 @@ find_func_clobbers (struct function *fn, gimple *origt)
 	      unsigned int fnpos, argpos;
 	      unsigned int implicit_use_args[2];
 	      unsigned int num_implicit_use_args = 0;
+	      bool oacc_parallel = false;
 	      switch (DECL_FUNCTION_CODE (decl))
 		{
 		case BUILT_IN_GOMP_PARALLEL:
@@ -5205,15 +5211,19 @@ find_func_clobbers (struct function *fn, gimple *origt)
 		case BUILT_IN_GOACC_PARALLEL:
 		  /* __builtin_GOACC_parallel (device, fn, mapnum, hostaddrs,
 					       sizes, kinds, ...).  */
-		  fnpos = 1;
-		  argpos = 3;
-		  implicit_use_args[num_implicit_use_args++] = 4;
+		  fnpos = 2;
+		  argpos = 4;
 		  implicit_use_args[num_implicit_use_args++] = 5;
+		  implicit_use_args[num_implicit_use_args++] = 6;
+		  oacc_parallel = gimple_call_arg (t, 1) == integer_one_node;
 		  break;
 		default:
 		  gcc_unreachable ();
 		}
 
+	      if (oacc_parallel)
+		break;
+
 	      tree fnarg = gimple_call_arg (t, fnpos);
 	      gcc_assert (TREE_CODE (fnarg) == ADDR_EXPR);
 	      tree fndecl = TREE_OPERAND (fnarg, 0);
@@ -7968,7 +7978,7 @@ ipa_pta_execute (void)
 		if (gimple_call_builtin_p (stmt, BUILT_IN_GOMP_PARALLEL))
 		  called_decl = TREE_OPERAND (gimple_call_arg (stmt, 0), 0);
 		else if (gimple_call_builtin_p (stmt, BUILT_IN_GOACC_PARALLEL))
-		  called_decl = TREE_OPERAND (gimple_call_arg (stmt, 1), 0);
+		  called_decl = TREE_OPERAND (gimple_call_arg (stmt, 2), 0);
 
 		if (called_decl != NULL_TREE
 		    && !fndecl_maybe_in_other_partition (called_decl))
diff --git a/libgomp/Makefile.am b/libgomp/Makefile.am
index 99ad2fd456d..4de30914d3d 100644
--- a/libgomp/Makefile.am
+++ b/libgomp/Makefile.am
@@ -13,9 +13,16 @@ search_path = $(addprefix $(top_srcdir)/config/, $(config_path)) $(top_srcdir) \
 fincludedir = $(libdir)/gcc/$(target_alias)/$(gcc_version)$(MULTISUBDIR)/finclude
 libsubincludedir = $(libdir)/gcc/$(target_alias)/$(gcc_version)/include
 
+LIBFFI = @LIBFFI@
+LIBFFIINCS = @LIBFFIINCS@
+
+if USE_LIBFFI
+libgomp_la_LIBADD = $(LIBFFI)
+endif
+
 vpath % $(strip $(search_path))
 
-AM_CPPFLAGS = $(addprefix -I, $(search_path))
+AM_CPPFLAGS = $(addprefix -I, $(search_path)) $(LIBFFIINCS)
 AM_CFLAGS = $(XCFLAGS)
 AM_LDFLAGS = $(XLDFLAGS) $(SECTION_LDFLAGS) $(OPT_LDFLAGS)
 
diff --git a/libgomp/Makefile.in b/libgomp/Makefile.in
index 7a84b5681e1..617615d4d52 100644
--- a/libgomp/Makefile.in
+++ b/libgomp/Makefile.in
@@ -171,7 +171,6 @@ libgomp_plugin_nvptx_la_LINK = $(LIBTOOL) --tag=CC \
 	$(libgomp_plugin_nvptx_la_LDFLAGS) $(LDFLAGS) -o $@
 @PLUGIN_NVPTX_TRUE@am_libgomp_plugin_nvptx_la_rpath = -rpath \
 @PLUGIN_NVPTX_TRUE@	$(toolexeclibdir)
-libgomp_la_LIBADD =
 @USE_FORTRAN_TRUE@am__objects_1 = openacc.lo
 am_libgomp_la_OBJECTS = alloc.lo atomic.lo barrier.lo critical.lo \
 	env.lo error.lo icv.lo icv-device.lo iter.lo iter_ull.lo \
@@ -279,6 +278,8 @@ INSTALL_SCRIPT = @INSTALL_SCRIPT@
 INSTALL_STRIP_PROGRAM = @INSTALL_STRIP_PROGRAM@
 LD = @LD@
 LDFLAGS = @LDFLAGS@
+LIBFFI = @LIBFFI@
+LIBFFIINCS = @LIBFFIINCS@
 LIBOBJS = @LIBOBJS@
 LIBS = @LIBS@
 LIBTOOL = @LIBTOOL@
@@ -410,7 +411,8 @@ search_path = $(addprefix $(top_srcdir)/config/, $(config_path)) $(top_srcdir) \
 
 fincludedir = $(libdir)/gcc/$(target_alias)/$(gcc_version)$(MULTISUBDIR)/finclude
 libsubincludedir = $(libdir)/gcc/$(target_alias)/$(gcc_version)/include
-AM_CPPFLAGS = $(addprefix -I, $(search_path))
+libgomp_la_LIBADD = $(LIBFFI)
+AM_CPPFLAGS = $(addprefix -I, $(search_path)) $(LIBFFIINCS)
 AM_CFLAGS = $(XCFLAGS)
 AM_LDFLAGS = $(XLDFLAGS) $(SECTION_LDFLAGS) $(OPT_LDFLAGS)
 toolexeclib_LTLIBRARIES = libgomp.la $(am__append_1) $(am__append_2)
diff --git a/libgomp/config.h.in b/libgomp/config.h.in
index 2f45aa74bbe..65e01c5376a 100644
--- a/libgomp/config.h.in
+++ b/libgomp/config.h.in
@@ -189,5 +189,8 @@
 /* Define to 1 if the target use emutls for thread-local storage. */
 #undef USE_EMUTLS
 
+/* Define to 1 if the target requires libffi to call the offloaded funtions. */
+#undef USE_LIBFFI
+
 /* Version number of package */
 #undef VERSION
diff --git a/libgomp/configure b/libgomp/configure
index 11f5b0b1e1c..cc24a81372e 100755
--- a/libgomp/configure
+++ b/libgomp/configure
@@ -649,6 +649,10 @@ PLUGIN_NVPTX
 CUDA_DRIVER_LIB
 CUDA_DRIVER_INCLUDE
 offload_targets
+USE_LIBFFI_FALSE
+USE_LIBFFI_TRUE
+LIBFFIINCS
+LIBFFI
 libtool_VERSION
 ac_ct_FC
 FCFLAGS
@@ -2655,7 +2659,6 @@ else
 fi
 
 
-
 # -------
 # -------
 
@@ -11155,7 +11158,7 @@ else
   lt_dlunknown=0; lt_dlno_uscore=1; lt_dlneed_uscore=2
   lt_status=$lt_dlunknown
   cat > conftest.$ac_ext <<_LT_EOF
-#line 11158 "configure"
+#line 11161 "configure"
 #include "confdefs.h"
 
 #if HAVE_DLFCN_H
@@ -11261,7 +11264,7 @@ else
   lt_dlunknown=0; lt_dlno_uscore=1; lt_dlneed_uscore=2
   lt_status=$lt_dlunknown
   cat > conftest.$ac_ext <<_LT_EOF
-#line 11264 "configure"
+#line 11267 "configure"
 #include "confdefs.h"
 
 #if HAVE_DLFCN_H
@@ -15137,6 +15140,28 @@ $as_echo "#define LIBGOMP_OFFLOADED_ONLY 1" >>confdefs.h
 
 fi
 
+# Prepare libffi when necessary.
+
+LIBFFI=
+LIBFFIINCS=
+if test -d ../libffi; then
+
+$as_echo "#define USE_LIBFFI 1" >>confdefs.h
+
+   LIBFFI=../libffi/libffi_convenience.la
+   LIBFFIINCS='-I$(top_srcdir)/../libffi/include -I../libffi/include'
+fi
+
+
+ if test -d ../libffi; then
+  USE_LIBFFI_TRUE=
+  USE_LIBFFI_FALSE='#'
+else
+  USE_LIBFFI_TRUE='#'
+  USE_LIBFFI_FALSE=
+fi
+
+
 # Plugins for offload execution, configure.ac fragment.  -*- mode: autoconf -*-
 #
 # Copyright (C) 2014-2017 Free Software Foundation, Inc.
@@ -16960,6 +16985,10 @@ if test -z "${MAINTAINER_MODE_TRUE}" && test -z "${MAINTAINER_MODE_FALSE}"; then
   as_fn_error "conditional \"MAINTAINER_MODE\" was never defined.
 Usually this means the macro was only invoked conditionally." "$LINENO" 5
 fi
+if test -z "${USE_LIBFFI_TRUE}" && test -z "${USE_LIBFFI_FALSE}"; then
+  as_fn_error "conditional \"USE_LIBFFI\" was never defined.
+Usually this means the macro was only invoked conditionally." "$LINENO" 5
+fi
 if test -z "${PLUGIN_NVPTX_TRUE}" && test -z "${PLUGIN_NVPTX_FALSE}"; then
   as_fn_error "conditional \"PLUGIN_NVPTX\" was never defined.
 Usually this means the macro was only invoked conditionally." "$LINENO" 5
diff --git a/libgomp/configure.ac b/libgomp/configure.ac
index a42d4f08b4b..aa49577537e 100644
--- a/libgomp/configure.ac
+++ b/libgomp/configure.ac
@@ -28,7 +28,6 @@ LIBGOMP_ENABLE(generated-files-in-srcdir, no, ,
 AC_MSG_RESULT($enable_generated_files_in_srcdir)
 AM_CONDITIONAL(GENINSRC, test "$enable_generated_files_in_srcdir" = yes)
 
-
 # -------
 # -------
 
@@ -215,6 +214,19 @@ if test x$libgomp_offloaded_only = xyes; then
             [Define to 1 if building libgomp for an accelerator-only target.])
 fi
 
+# Prepare libffi when necessary.
+
+LIBFFI=
+LIBFFIINCS=
+if test -d ../libffi; then
+   AC_DEFINE(USE_LIBFFI, 1, [Define if we're to use libffi.])
+   LIBFFI=../libffi/libffi_convenience.la
+   LIBFFIINCS='-I$(top_srcdir)/../libffi/include -I../libffi/include'
+fi
+AC_SUBST(LIBFFI)
+AC_SUBST(LIBFFIINCS)
+AM_CONDITIONAL([USE_LIBFFI], [test -d ../libffi])
+
 m4_include([plugin/configfrag.ac])
 
 # Check for functions needed.
diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index c025069b457..44097cfd56a 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -119,6 +119,13 @@ extern void GOMP_OFFLOAD_openacc_exec (void (*) (void *), size_t, void **,
 extern void GOMP_OFFLOAD_openacc_async_exec (void (*) (void *), size_t, void **,
 					     void **, unsigned *, void *,
 					     struct goacc_asyncqueue *);
+extern void GOMP_OFFLOAD_openacc_exec_params (void (*) (void *), size_t,
+					      void **, void **, unsigned *,
+					      void *);
+extern void GOMP_OFFLOAD_openacc_async_exec_params (void (*) (void *), size_t,
+						    void **, void **,
+						    unsigned *, void *,
+						    struct goacc_asyncqueue *);
 extern struct goacc_asyncqueue *GOMP_OFFLOAD_openacc_async_construct (void);
 extern bool GOMP_OFFLOAD_openacc_async_destruct (struct goacc_asyncqueue *);
 extern int GOMP_OFFLOAD_openacc_async_test (struct goacc_asyncqueue *);
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 59e7ca8b8c8..a31c83cc656 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -885,6 +885,7 @@ typedef struct acc_dispatch_t
 
   /* Execute.  */
   __typeof (GOMP_OFFLOAD_openacc_exec) *exec_func;
+  __typeof (GOMP_OFFLOAD_openacc_exec_params) *exec_params_func;
 
   struct {
     gomp_mutex_t lock;
@@ -900,6 +901,7 @@ typedef struct acc_dispatch_t
     __typeof (GOMP_OFFLOAD_openacc_async_queue_callback) *queue_callback_func;
 
     __typeof (GOMP_OFFLOAD_openacc_async_exec) *exec_func;
+    __typeof (GOMP_OFFLOAD_openacc_async_exec_params) *exec_params_func;
     __typeof (GOMP_OFFLOAD_openacc_async_host2dev) *host2dev_func;
     __typeof (GOMP_OFFLOAD_openacc_async_dev2host) *dev2host_func;
   } async;
diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map
index 546ac929a0e..7a49acc1dfe 100644
--- a/libgomp/libgomp.map
+++ b/libgomp/libgomp.map
@@ -461,8 +461,10 @@ GOACC_2.0.1 {
 GOACC_2.0.GOMP_4_BRANCH {
   global:
 	GOMP_set_offload_targets;
+	GOACC_parallel_keyed_v2;
 } GOACC_2.0.1;
 
+
 GOMP_PLUGIN_1.0 {
   global:
 	GOMP_PLUGIN_malloc;
diff --git a/libgomp/libgomp_g.h b/libgomp/libgomp_g.h
index 958ca6e9cc3..c40e67f2e80 100644
--- a/libgomp/libgomp_g.h
+++ b/libgomp/libgomp_g.h
@@ -298,6 +298,8 @@ extern void GOMP_teams (unsigned int, unsigned int);
 
 extern void GOACC_parallel_keyed (int, void (*) (void *), size_t,
 				  void **, size_t *, unsigned short *, ...);
+extern void GOACC_parallel_keyed_v2 (int, int, void (*) (void *), size_t,
+				  void **, size_t *, unsigned short *, ...);
 extern void GOACC_parallel (int, void (*) (void *), size_t, void **, size_t *,
 			    unsigned short *, int, int, int, int, int, ...);
 extern void GOACC_data_start (int, size_t, void **, size_t *,
diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c
index 3b2cafb2c55..5b4e34d7190 100644
--- a/libgomp/oacc-host.c
+++ b/libgomp/oacc-host.c
@@ -158,6 +158,30 @@ host_openacc_async_exec (void (*fn) (void *),
   fn (hostaddrs);
 }
 
+static void
+host_openacc_exec_params (void (*fn) (void *),
+			  size_t mapnum __attribute__ ((unused)),
+			  void **hostaddrs,
+			  void **devaddrs __attribute__ ((unused)),
+			  unsigned *dims __attribute__ ((unused)),
+			  void *targ_mem_desc __attribute__ ((unused)))
+{
+  fn (hostaddrs);
+}
+
+static void
+host_openacc_async_exec_params (void (*fn) (void *),
+				size_t mapnum __attribute__ ((unused)),
+				void **hostaddrs,
+				void **devaddrs __attribute__ ((unused)),
+				unsigned *dims __attribute__ ((unused)),
+				void *targ_mem_desc __attribute__ ((unused)),
+				struct goacc_asyncqueue *aq __attribute__ ((unused)))
+{
+  fn (hostaddrs);
+}
+
+
 static int
 host_openacc_async_test (struct goacc_asyncqueue *aq __attribute__ ((unused)))
 {
@@ -265,6 +289,7 @@ static struct gomp_device_descr host_dispatch =
       .data_environ = NULL,
 
       .exec_func = host_openacc_exec,
+      .exec_params_func = host_openacc_exec_params,
 
       .async = {
 	.construct_func = host_openacc_async_construct,
@@ -274,6 +299,7 @@ static struct gomp_device_descr host_dispatch =
 	.serialize_func = host_openacc_async_serialize,
 	.queue_callback_func = host_openacc_async_queue_callback,
 	.exec_func = host_openacc_async_exec,
+	.exec_params_func = host_openacc_async_exec_params,
 	.dev2host_func = host_openacc_async_dev2host,
 	.host2dev_func = host_openacc_async_host2dev,
       },
diff --git a/libgomp/oacc-parallel.c b/libgomp/oacc-parallel.c
index 1172d739ec7..3c5aa24b5f5 100644
--- a/libgomp/oacc-parallel.c
+++ b/libgomp/oacc-parallel.c
@@ -31,6 +31,9 @@
 #include "libgomp_g.h"
 #include "gomp-constants.h"
 #include "oacc-int.h"
+#if USE_LIBFFI
+# include "ffi.h"
+#endif
 #ifdef HAVE_INTTYPES_H
 # include <inttypes.h>  /* For PRIu64.  */
 #endif
@@ -104,19 +107,47 @@ handle_ftn_pointers (size_t mapnum, void **hostaddrs, size_t *sizes,
 
 static void goacc_wait (int async, int num_waits, va_list *ap);
 
+static void
+goacc_call_host_fn (void (*fn) (void *), size_t mapnum, void **hostaddrs,
+		    int params)
+{
+#ifdef USE_LIBFFI
+  ffi_cif cif;
+  ffi_type *arg_types[mapnum];
+  void *arg_values[mapnum];
+  ffi_arg result;
+  int i;
+
+  if (params)
+    {
+      for (i = 0; i < mapnum; i++)
+	{
+	  arg_types[i] = &ffi_type_pointer;
+	  arg_values[i] = &hostaddrs[i];
+	}
+
+      if (ffi_prep_cif (&cif, FFI_DEFAULT_ABI, mapnum,
+			&ffi_type_void, arg_types) == FFI_OK)
+	ffi_call (&cif, FFI_FN (fn), &result, arg_values);
+      else
+	abort ();
+    }
+  else
+#endif
+  fn (hostaddrs);
+}
 
 /* Launch a possibly offloaded function on DEVICE.  FN is the host fn
    address.  MAPNUM, HOSTADDRS, SIZES & KINDS  describe the memory
    blocks to be copied to/from the device.  Varadic arguments are
    keyed optional parameters terminated with a zero.  */
 
-void
-GOACC_parallel_keyed (int device, void (*fn) (void *),
-		      size_t mapnum, void **hostaddrs, size_t *sizes,
-		      unsigned short *kinds, ...)
+static void
+GOACC_parallel_keyed_internal (int device, int params, void (*fn) (void *),
+			       size_t mapnum, void **hostaddrs, size_t *sizes,
+			       unsigned short *kinds, va_list *ap)
 {
   bool host_fallback = device == GOMP_DEVICE_HOST_FALLBACK;
-  va_list ap;
   struct goacc_thread *thr;
   struct gomp_device_descr *acc_dev;
   struct target_mem_desc *tgt;
@@ -206,13 +237,13 @@ GOACC_parallel_keyed (int device, void (*fn) (void *),
       prof_info.device_type = acc_device_host;
       api_info.device_type = prof_info.device_type;
       goacc_save_and_set_bind (acc_device_host);
-      fn (hostaddrs);
+      goacc_call_host_fn (fn, mapnum, hostaddrs, params);
       goacc_restore_bind ();
       goto out;
     }
   else if (acc_device_type (acc_dev->type) == acc_device_host)
     {
-      fn (hostaddrs);
+      goacc_call_host_fn (fn, mapnum, hostaddrs, params);
       goto out;
     }
   else if (profiling_dispatch_p)
@@ -222,9 +253,8 @@ GOACC_parallel_keyed (int device, void (*fn) (void *),
   for (i = 0; i != GOMP_DIM_MAX; i++)
     dims[i] = 0;
 
-  va_start (ap, kinds);
   /* TODO: This will need amending when device_type is implemented.  */
-  while ((tag = va_arg (ap, unsigned)) != 0)
+  while ((tag = va_arg (*ap, unsigned)) != 0)
     {
       if (GOMP_LAUNCH_DEVICE (tag))
 	gomp_fatal ("device_type '%d' offload parameters, libgomp is too old",
@@ -238,7 +268,7 @@ GOACC_parallel_keyed (int device, void (*fn) (void *),
 
 	    for (i = 0; i != GOMP_DIM_MAX; i++)
 	      if (mask & GOMP_DIM_MASK (i))
-		dims[i] = va_arg (ap, unsigned);
+		dims[i] = va_arg (*ap, unsigned);
 	  }
 	  break;
 
@@ -248,7 +278,7 @@ GOACC_parallel_keyed (int device, void (*fn) (void *),
 	    async = GOMP_LAUNCH_OP (tag);
 
 	    if (async == GOMP_LAUNCH_OP_MAX)
-	      async = va_arg (ap, unsigned);
+	      async = va_arg (*ap, unsigned);
 
 	    if (profiling_dispatch_p)
 	      {
@@ -267,7 +297,7 @@ GOACC_parallel_keyed (int device, void (*fn) (void *),
 	    int num_waits = ((signed short) GOMP_LAUNCH_OP (tag));
 
 	    if (num_waits > 0)
-	      goacc_wait (async, num_waits, &ap);
+	      goacc_wait (async, num_waits, ap);
 	    else if (num_waits == acc_async_noval)
 	      acc_wait_all_async (async);
 	    break;
@@ -278,7 +308,6 @@ GOACC_parallel_keyed (int device, void (*fn) (void *),
 		      " libgomp is too old", GOMP_LAUNCH_CODE (tag));
 	}
     }
-  va_end (ap);
   
   if (!(acc_dev->capabilities & GOMP_OFFLOAD_CAP_NATIVE_EXEC))
     {
@@ -338,8 +367,12 @@ GOACC_parallel_keyed (int device, void (*fn) (void *),
 
   if (aq == NULL)
     {
-      acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
-				  dims, tgt);
+      if (params)
+	acc_dev->openacc.exec_params_func (tgt_fn, mapnum, hostaddrs, devaddrs,
+					   dims, tgt);
+      else
+	acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
+				    dims, tgt);
       if (profiling_dispatch_p)
 	{
 	  prof_info.event_type = acc_ev_exit_data_start;
@@ -362,8 +395,12 @@ GOACC_parallel_keyed (int device, void (*fn) (void *),
     }
   else
     {
-      acc_dev->openacc.async.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
-					dims, tgt, aq);
+      if (params)
+	acc_dev->openacc.async.exec_params_func (tgt_fn, mapnum, hostaddrs,
+						 devaddrs, dims, tgt, aq);
+      else
+	acc_dev->openacc.async.exec_func (tgt_fn, mapnum, hostaddrs,
+					  devaddrs, dims, tgt, aq);
       goacc_async_copyout_unmap_vars (tgt, aq);
     }
 
@@ -381,6 +418,30 @@ GOACC_parallel_keyed (int device, void (*fn) (void *),
     }
 }
 
+void
+GOACC_parallel_keyed (int device, void (*fn) (void *),
+		      size_t mapnum, void **hostaddrs, size_t *sizes,
+		      unsigned short *kinds, ...)
+{
+  va_list ap;
+  va_start (ap, kinds);
+  GOACC_parallel_keyed_internal (device, 0, fn, mapnum, hostaddrs, sizes,
+				 kinds, &ap);
+  va_end (ap);
+}
+
+void
+GOACC_parallel_keyed_v2 (int device, int args, void (*fn) (void *),
+			 size_t mapnum, void **hostaddrs, size_t *sizes,
+			 unsigned short *kinds, ...)
+{
+  va_list ap;
+  va_start (ap, kinds);
+  GOACC_parallel_keyed_internal (device, args, fn, mapnum, hostaddrs, sizes,
+				 kinds, &ap);
+  va_end (ap);
+}
+
 /* Legacy entry point, only provide host execution.  */
 
 void
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 94abfe2036f..bdc0c30e1f5 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -697,12 +697,11 @@ link_ptx (CUmodule *module, const struct targ_ptx_obj *ptx_objs,
 static void
 nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
 	    unsigned *dims, void *targ_mem_desc,
-	    CUdeviceptr dp, CUstream stream)
+	    void **kargs, CUstream stream)
 {
   struct targ_fn_descriptor *targ_fn = (struct targ_fn_descriptor *) fn;
   CUfunction function;
   int i;
-  void *kargs[1];
   int cpu_size = nvptx_thread ()->ptx_dev->max_threads_per_multiprocessor;
   int block_size = nvptx_thread ()->ptx_dev->max_threads_per_block;
   int dev_size = nvptx_thread ()->ptx_dev->num_sms;
@@ -888,7 +887,6 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
 					    api_info);
     }
   
-  kargs[0] = &dp;
   CUDA_CALL_ASSERT (cuLaunchKernel, function,
 		    dims[GOMP_DIM_GANG], 1, 1,
 		    dims[GOMP_DIM_VECTOR], dims[GOMP_DIM_WORKER], 1,
@@ -1293,22 +1291,29 @@ GOMP_OFFLOAD_free (int ord, void *ptr)
 	  && nvptx_free (ptr, ptx_devices[ord]));
 }
 
-void
-GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), size_t mapnum,
-			   void **hostaddrs, void **devaddrs,
-			   unsigned *dims, void *targ_mem_desc)
+static void
+openacc_exec_internal (void (*fn) (void *), int params, size_t mapnum,
+		       void **hostaddrs, void **devaddrs,
+		       unsigned *dims, void *targ_mem_desc)
 {
   GOMP_PLUGIN_debug (0, "  %s: prepare mappings\n", __FUNCTION__);
 
-  void **hp = NULL;
+  void **hp = alloca (mapnum * sizeof (void *));
   CUdeviceptr dp = 0;
 
   if (mapnum > 0)
     {
-      hp = alloca (mapnum * sizeof (void *));
-      for (int i = 0; i < mapnum; i++)
-	hp[i] = (devaddrs[i] ? devaddrs[i] : hostaddrs[i]);
-      CUDA_CALL_ASSERT (cuMemAlloc, &dp, mapnum * sizeof (void *));
+      if (params)
+	{
+	  for (int i = 0; i < mapnum; i++)
+	    hp[i] = (devaddrs[i] ? &devaddrs[i] : &hostaddrs[i]);
+	}
+      else
+	{
+	  for (int i = 0; i < mapnum; i++)
+	    hp[i] = (devaddrs[i] ? devaddrs[i] : hostaddrs[i]);
+	  CUDA_CALL_ASSERT (cuMemAlloc, &dp, mapnum * sizeof (void *));
+	}
     }
 
   /* Copy the (device) pointers to arguments to the device (dp and hp might in
@@ -1333,7 +1338,8 @@ GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), size_t mapnum,
       data_event_info.data_event.var_name = NULL; //TODO
       data_event_info.data_event.bytes = mapnum * sizeof (void *);
       data_event_info.data_event.host_ptr = hp;
-      data_event_info.data_event.device_ptr = (void *) dp;
+      if (!params)
+	data_event_info.data_event.device_ptr = (void *) dp;
 
       api_info->device_api = acc_device_api_cuda;
 
@@ -1341,7 +1347,7 @@ GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), size_t mapnum,
 					    api_info);
     }
 
-  if (mapnum > 0)
+  if (!params && mapnum > 0)
     CUDA_CALL_ASSERT (cuMemcpyHtoD, dp, (void *) hp,
 		      mapnum * sizeof (void *));
 
@@ -1353,8 +1359,15 @@ GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), size_t mapnum,
 					    api_info);
     }
 
-  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
-	      dp, NULL);
+  if (params)
+    nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
+		hp, NULL);
+  else
+    {
+      void *kargs[1] = { &dp };
+      nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
+		  kargs, NULL);
+    }
 
   CUresult r = cuStreamSynchronize (NULL);
   const char *maybe_abort_msg = "(perhaps abort was called)";
@@ -1363,7 +1376,27 @@ GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), size_t mapnum,
 		       maybe_abort_msg);
   else if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s", cuda_error (r));
-  CUDA_CALL_ASSERT (cuMemFree, dp);
+
+  if (!params)
+    CUDA_CALL_ASSERT (cuMemFree, dp);
+}
+
+void
+GOMP_OFFLOAD_openacc_exec_params (void (*fn) (void *), size_t mapnum,
+			   void **hostaddrs, void **devaddrs,
+			   unsigned *dims, void *targ_mem_desc)
+{
+  openacc_exec_internal (fn, 1, mapnum, hostaddrs, devaddrs, dims,
+			 targ_mem_desc);
+}
+
+void
+GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), size_t mapnum,
+			   void **hostaddrs, void **devaddrs,
+			   unsigned *dims, void *targ_mem_desc)
+{
+  openacc_exec_internal (fn, 0, mapnum, hostaddrs, devaddrs, dims,
+			 targ_mem_desc);
 }
 
 static void
@@ -1374,11 +1407,11 @@ cuda_free_argmem (void *ptr)
   free (block);
 }
 
-void
-GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *), size_t mapnum,
-				 void **hostaddrs, void **devaddrs,
-				 unsigned *dims, void *targ_mem_desc,
-				 struct goacc_asyncqueue *aq)
+static void
+openacc_async_exec_internal (void (*fn) (void *), int params, size_t mapnum,
+			     void **hostaddrs, void **devaddrs,
+			     unsigned *dims, void *targ_mem_desc,
+			     struct goacc_asyncqueue *aq)
 {
   GOMP_PLUGIN_debug (0, "  %s: prepare mappings\n", __FUNCTION__);
 
@@ -1388,11 +1421,20 @@ GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *), size_t mapnum,
 
   if (mapnum > 0)
     {
-      block = (void **) GOMP_PLUGIN_malloc ((mapnum + 2) * sizeof (void *));
-      hp = block + 2;
-      for (int i = 0; i < mapnum; i++)
-	hp[i] = (devaddrs[i] ? devaddrs[i] : hostaddrs[i]);
-      CUDA_CALL_ASSERT (cuMemAlloc, &dp, mapnum * sizeof (void *));
+      if (params)
+	{
+	  hp = alloca (sizeof (void *) * mapnum);
+	  for (int i = 0; i < mapnum; i++)
+	    hp[i] = (devaddrs[i] ? &devaddrs[i] : &hostaddrs[i]);
+	}
+      else
+	{
+	  block = (void **) GOMP_PLUGIN_malloc ((mapnum + 2) * sizeof (void *));
+	  hp = block + 2;
+	  for (int i = 0; i < mapnum; i++)
+	    hp[i] = (devaddrs[i] ? devaddrs[i] : hostaddrs[i]);
+	  CUDA_CALL_ASSERT (cuMemAlloc, &dp, mapnum * sizeof (void *));
+	}
     }
 
   /* Copy the (device) pointers to arguments to the device (dp and hp might in
@@ -1417,7 +1459,8 @@ GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *), size_t mapnum,
       data_event_info.data_event.var_name = NULL; //TODO
       data_event_info.data_event.bytes = mapnum * sizeof (void *);
       data_event_info.data_event.host_ptr = hp;
-      data_event_info.data_event.device_ptr = (void *) dp;
+      if (!params)
+	data_event_info.data_event.device_ptr = (void *) dp;
 
       api_info->device_api = acc_device_api_cuda;
 
@@ -1425,7 +1468,7 @@ GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *), size_t mapnum,
 					    api_info);
     }
 
-  if (mapnum > 0)
+  if (!params && mapnum > 0)
     {
       CUDA_CALL_ASSERT (cuMemcpyHtoDAsync, dp, (void *) hp,
 			mapnum * sizeof (void *), aq->cuda_stream);
@@ -1443,14 +1486,42 @@ GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *), size_t mapnum,
       GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
 					    api_info);
     }
-  
-  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
-	      dp, aq->cuda_stream);
 
-  if (mapnum > 0)
+  if (params)
+    nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
+		hp, aq->cuda_stream);
+  else
+    {
+      void *kargs[1] = { &dp };
+      nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
+		  kargs, aq->cuda_stream);
+    }
+
+  if (!params && mapnum > 0)
     GOMP_OFFLOAD_openacc_async_queue_callback (aq, cuda_free_argmem, block);
 }
 
+void
+GOMP_OFFLOAD_openacc_async_exec_params (void (*fn) (void *), size_t mapnum,
+				 void **hostaddrs, void **devaddrs,
+				 unsigned *dims, void *targ_mem_desc,
+				 struct goacc_asyncqueue *aq)
+{
+  openacc_async_exec_internal (fn, 1, mapnum, hostaddrs, devaddrs, dims,
+			       targ_mem_desc, aq);
+}
+
+void
+GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *), size_t mapnum,
+				 void **hostaddrs, void **devaddrs,
+				 unsigned *dims, void *targ_mem_desc,
+				 struct goacc_asyncqueue *aq)
+{
+  openacc_async_exec_internal (fn, 0, mapnum, hostaddrs, devaddrs, dims,
+			       targ_mem_desc, aq);
+}
+
+
 void *
 GOMP_OFFLOAD_openacc_create_thread_data (int ord)
 {
diff --git a/libgomp/target.c b/libgomp/target.c
index 336581d2196..10c5e34f378 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -2908,6 +2908,7 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device,
   if (device->capabilities & GOMP_OFFLOAD_CAP_OPENACC_200)
     {
       if (!DLSYM_OPT (openacc.exec, openacc_exec)
+	  || !DLSYM_OPT (openacc.exec_params, openacc_exec_params)
 	  || !DLSYM_OPT (openacc.create_thread_data,
 			 openacc_create_thread_data)
 	  || !DLSYM_OPT (openacc.destroy_thread_data,
@@ -2920,6 +2921,7 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device,
 	  || !DLSYM_OPT (openacc.async.queue_callback,
 			 openacc_async_queue_callback)
 	  || !DLSYM_OPT (openacc.async.exec, openacc_async_exec)
+	  || !DLSYM_OPT (openacc.async.exec_params, openacc_async_exec_params)
 	  || !DLSYM_OPT (openacc.async.dev2host, openacc_async_dev2host)
 	  || !DLSYM_OPT (openacc.async.host2dev, openacc_async_host2dev))
 	{
diff --git a/libgomp/testsuite/Makefile.in b/libgomp/testsuite/Makefile.in
index 6edb7ae7ade..4d7f43abe3d 100644
--- a/libgomp/testsuite/Makefile.in
+++ b/libgomp/testsuite/Makefile.in
@@ -120,6 +120,8 @@ INSTALL_SCRIPT = @INSTALL_SCRIPT@
 INSTALL_STRIP_PROGRAM = @INSTALL_STRIP_PROGRAM@
 LD = @LD@
 LDFLAGS = @LDFLAGS@
+LIBFFI = @LIBFFI@
+LIBFFIINCS = @LIBFFIINCS@
 LIBOBJS = @LIBOBJS@
 LIBS = @LIBS@
 LIBTOOL = @LIBTOOL@
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c
index dad6d13eb60..c6abc1d724a 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c
@@ -1,6 +1,11 @@
 /* This test exercises combined directives.  */
 
+/* This test falls back to host execution because struct alias
+   analysis is deactivated on OpenACC parallel regions.  Consequently,
+   parloops can no longer disambiguate arrays a and b.  */
+
 /* { dg-do run } */
+/* { dg-xfail-if "n/a" { openacc_nvidia_accel_selected } { "-O2" } { "" } } */
 
 #include <stdlib.h>
 

Reply via email to