On 12/18/2017 02:58 PM, Cesar Philippidis wrote: > Jakub, > > I'd like your thoughts on the following problem. > > One of the offloading bottlenecks with GPU acceleration in OpenACC is > the nontrivial offloaded function invocation overhead. At present, GCC > generates code to pass a struct containing one field for each of the > data mappings used in the OMP child function. I'm guessing a struct is > used because pthread_create only accepts a single for new threads. What > I'd like to do is to create the child function with one argument per > data mapping. This has a number of advantages: > > 1. No device memory needs to be managed for the child function data > mapping struct. > > 2. On PTX targets, the .param address space is cached. Using > individual parameters for function arguments will allow the nvptx > back end to generate a more relaxed "execution model" because the > thread initialization code will be accessing cache memory instead > of global memory. > > 3. It was my hope that this would set a path to eliminate the > GOMP_MAP_FIRSTPRIVATE_INT optimization, by replacing those mappings > with the actual value directly. > > 1) is huge for programs, such as cloverleaf, which launch a lot of small > parallel regions a lot of times. > > For the execution model in 2), OpenACC begins each parallel region in a > gang-redundant, worker-single and vector-single state. To transition > from a single-threaded (or single vector lane) state to a multi-threaded > partitioned state, GCC needs to emit code to propagate live variables, > both on the stack and registers to the spawned threads. A lot of loops, > including DGEMV from BLAS, can be executed in a fully-redundant state. > Executing code redundantly has the advantage of not requiring any state > transition code. The problem here is that because a) the struct is in > global memory, and b) not all of the GPU threads are executing the same > instruction at the same time. Consequently, initializing each thread in > a fully redundant manner actually hurts performance. When I rewrote the > same test case passing the data mappings via individual parameters, that > optimization improved performance compared to GCC trunk's baseline. > > Lastly, 3) is more of a simplification than anything else. I'm not too > concerned about this because those variables only get initialized once. > So long as they don't require a separate COPYIN data mapping, the > performance hit should be negligible. > > In this first attempt at using parameters I taught lower_omp_target how > to create child functions for OpenACC parallel regions with individual > parameters for the data mappings instead of using a large struct. This > works for the most part, but I realized too late that pthread_create > only passes one argument to each thread it creates. It should be noted > that I left the kernels implementation as-is, using the global struct > argument because kernels in GCC is largely ineffective and it usually > falls back to executing code on the host CPU. Eventually, we want to > redo kernels, but not until we get the parallel code running efficiently. > > For fallback host targets, libgomp is using libffi to pass arguments to > the offloaded functions. This works OK at the moment because the host > code is always single-threaded. Ideally, that would change in the > future, but I'm not aware of any immediate plans to do so. > > Question: is this approach acceptable for Stage 1 in May, or should I > make the offloaded function parameter expansion target-specific? I can > think a couple of ways to make this target-specific: > > a. Create two child functions during lowering, one with individual > parameters for the data mappings, and another which takes in a > single struct. The latter then calls the former immediately on > on entry. > > b. Teach oaccdevlow to expand the incoming struct into individual > parameters. > > I'm concerned that b) is going to be a large pass. The SRA pass is > somewhat large at 5k. While this should be simpler, I'm not sure by how > much (probably a lot because it won't need to preform as much analysis). > > While this patch is functional, it's not complete. I still need to tweak > a couple of things in the runtime. But I don't want to spend too much > time on it if we decide to go with a different approach. > > Any thoughts are welcome. > > By the way, next we'll be working on increasing vector_length on nvptx > targets. In conjunction with that, we'll simplifying the OpenACC > execution model in the nvptx BE, along with adding a new reduction > finalizer.
After thinking about this some more, I decided that it would be better expand the offloaded function arguments into individual parameters during omp lowering, rather than writing a separate pass later on. I don't see too many disadvantages of using libffi after a pthread is spawned by the host. If anything, the pthread's use of libffi is equivalent of preforming SRA by the accelerator anyway. I've committed this patch to openacc-gcc-7-branch. Note that I had to xfail libgomp.oacc-c-c++-common/combined-directives-1.c because I disabled struct analysis analysis on parallel regions. Unfortunately, that makes kernels slightly less effective. But more often than not, kernels regions fall back to host execution anyway. Cesar
2017-12-21 Cesar Philippidis <ce...@codesourcery.com> Makefile.def: Make libgomp depend on libffi. configure.ac: Likewise. Makefile.in: Regenerate. configure: Regenerate. gcc/fortran/ * types.def: (BF_FN_VOID_INT_INT_OMPFN_SIZE_PTR_PTR_PTR_VAR): Define. gcc/ * builtin-types.def (BF_FN_VOID_INT_INT_OMPFN_SIZE_PTR_PTR_PTR_VAR): Define. * config/nvptx/nvptx.c (nvptx_expand_cmp_swap): Handle PARM_DECLs. * omp-builtins.def (BUILD_IN_GOACC_PARALLEL): Call GOACC_parallel_keyed_v2. * omp-expand.c (expand_omp_target): Update call to BUILT_IN_GOACC_PARALLEL. * omp-low.c (struct omp_context): Add parm_map member. (lookup_parm): New function. (build_receiver_ref): Lookup parm_map decls. (install_parm_decl): New function. (install_var_field): Install parm_map decl for OpenACC parallel region data clauses. (delete_omp_context): Clean parm_map. (scan_sharing_clauses): Install subarray variable mapping into parm_map. (create_omp_child_function): Defer creation of child function for OpenACC parallel regions. (scan_omp_target): Likewise. (append_decl_arg): New function. (lower_omp_target): Create an child offloaded function using one parameter per data mapping for OpenACC parallel regions. * tree-ssa-structalias.c (find_func_aliases_for_builtin_call): Ignore OpenACC parallel regions. (find_func_clobbers): Likewise. (ipa_pta_execute): Likewise. libgomp/ * Makefile.am: Add libffi build dependency. * configure.ac: Likewise. * Makefile.in: Regenerate. * config.h.in: Regenerate. * configure: Regenerate. * libgomp-plugin.h: Define GOMP_OFFLOAD_openacc_exec_params and GOMP_OFFLOAD_openacc_async_exec_params. * libgomp.h (acc_dispatch_t): Use them here. * libgomp.map (GOACC_parallel_keyed_v2): Declare. * libgomp_g.h (GOACC_parallel_keyed_v2): Likewise. * oacc-host.c (host_openacc_exec_params): New function. (host_openacc_async_exec_params): Likewise. * oacc-parallel.c (goacc_call_host_fn): Likewise. (GOACC_parallel_keyed_internal): Likewise. (GOACC_parallel_keyed): Wrapper for GOACC_parallel_keyed_internal. (GOACC_parallel_keyed_v2): Likewise. * plugin/plugin-nvptx.c (nvptx_exec): Replace CUDeviceptr dp parameter with void **kargs. (openacc_exec_internal): New function. (GOMP_OFFLOAD_openacc_exec_params): New function. (GOMP_OFFLOAD_openacc_exec): Update to call openacc_exec_internal. (openacc_async_exec_internal): New function. (GOMP_OFFLOAD_openacc_async_exec_params): New function. (GOMP_OFFLOAD_openacc_async_exec): Update call to openacc_async_exec_internal. * target.c (gomp_load_plugin_for_device): Handle openacc_exec_params and openacc_async_exec_params. * testsuite/Makefile.in: Regenerate. * testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c: Xfail on offloaded targets. diff --git a/Makefile.def b/Makefile.def index abfa9efe959..5e94062fa75 100644 --- a/Makefile.def +++ b/Makefile.def @@ -550,6 +550,7 @@ dependencies = { module=configure-target-libgo; on=all-target-libstdc++-v3; }; dependencies = { module=all-target-libgo; on=all-target-libbacktrace; }; dependencies = { module=all-target-libgo; on=all-target-libffi; }; dependencies = { module=all-target-libgo; on=all-target-libatomic; }; +dependencies = { module=configure-target-libgomp; on=configure-target-libffi; }; dependencies = { module=configure-target-libstdc++-v3; on=configure-target-libgomp; }; dependencies = { module=configure-target-liboffloadmic; on=configure-target-libgomp; }; dependencies = { module=configure-target-libsanitizer; on=all-target-libstdc++-v3; }; @@ -564,6 +565,7 @@ dependencies = { module=install-target-libgo; on=install-target-libatomic; }; dependencies = { module=install-target-libgfortran; on=install-target-libquadmath; }; dependencies = { module=install-target-libgfortran; on=install-target-libgcc; }; dependencies = { module=install-target-libsanitizer; on=install-target-libstdc++-v3; }; +dependencies = { module=install-target-libgomp; on=install-target-libffi; }; dependencies = { module=install-target-libsanitizer; on=install-target-libgcc; }; dependencies = { module=install-target-libvtv; on=install-target-libstdc++-v3; }; dependencies = { module=install-target-libvtv; on=install-target-libgcc; }; diff --git a/Makefile.in b/Makefile.in index b824e0a0ca1..9b4497e3943 100644 --- a/Makefile.in +++ b/Makefile.in @@ -55803,6 +55803,7 @@ configure-target-libgo: maybe-all-target-libstdc++-v3 all-target-libgo: maybe-all-target-libbacktrace all-target-libgo: maybe-all-target-libffi all-target-libgo: maybe-all-target-libatomic +configure-target-libgomp: maybe-configure-target-libffi configure-target-libstdc++-v3: maybe-configure-target-libgomp configure-stage1-target-libstdc++-v3: maybe-configure-stage1-target-libgomp @@ -55849,6 +55850,7 @@ install-target-libgo: maybe-install-target-libatomic install-target-libgfortran: maybe-install-target-libquadmath install-target-libgfortran: maybe-install-target-libgcc install-target-libsanitizer: maybe-install-target-libstdc++-v3 +install-target-libgomp: maybe-install-target-libffi install-target-libsanitizer: maybe-install-target-libgcc install-target-libvtv: maybe-install-target-libstdc++-v3 install-target-libvtv: maybe-install-target-libgcc diff --git a/configure b/configure index 32a38633ad8..ed47944d8f9 100755 --- a/configure +++ b/configure @@ -3472,11 +3472,19 @@ case "${target}" in ft32-*-*) noconfigdirs="$noconfigdirs target-libffi" ;; + nvptx-*-*) + noconfigdirs="$noconfigdirs target-libffi" + ;; *-*-lynxos*) noconfigdirs="$noconfigdirs target-libffi" ;; esac +libgomp_deps="target-libffi" +if echo " ${noconfigdirs} " | grep " target-libffi " > /dev/null 2>&1 ; then + libgomp_deps="" +fi + # Disable the go frontend on systems where it is known to not work. Please keep # this in sync with contrib/config-list.mk. case "${target}" in @@ -6460,6 +6468,15 @@ esac # $build_configdirs and $target_configdirs. # If we have the source for $noconfigdirs entries, add them to $notsupp. +# libgomp depends on libffi. Remove it from nonsupp if necessary. +if ! (echo " $noconfigdirs " | grep " target-libgomp " >/dev/null 2>&1); then + if echo " $noconfigdirs " | grep " target-libffi " >/dev/null 2>&1; then + if test "x${libgomp_deps}" != x; then + noconfigdirs=`echo " $noconfigdirs " | sed -e "s/ target-libffi / /"` + fi + fi +fi + notsupp="" for dir in . $skipdirs $noconfigdirs ; do dirname=`echo $dir | sed -e s/target-//g -e s/build-//g` diff --git a/configure.ac b/configure.ac index 12377499295..a3b9e116a05 100644 --- a/configure.ac +++ b/configure.ac @@ -800,11 +800,19 @@ case "${target}" in ft32-*-*) noconfigdirs="$noconfigdirs target-libffi" ;; + nvptx-*-*) + noconfigdirs="$noconfigdirs target-libffi" + ;; *-*-lynxos*) noconfigdirs="$noconfigdirs target-libffi" ;; esac +libgomp_deps="target-libffi" +if echo " ${noconfigdirs} " | grep " target-libffi " > /dev/null 2>&1 ; then + libgomp_deps="" +fi + # Disable the go frontend on systems where it is known to not work. Please keep # this in sync with contrib/config-list.mk. case "${target}" in @@ -2127,6 +2135,15 @@ esac # $build_configdirs and $target_configdirs. # If we have the source for $noconfigdirs entries, add them to $notsupp. +# libgomp depends on libffi. Remove it from nonsupp if necessary. +if ! (echo " $noconfigdirs " | grep " target-libgomp " >/dev/null 2>&1); then + if echo " $noconfigdirs " | grep " target-libffi " >/dev/null 2>&1; then + if test "x${libgomp_deps}" != x; then + noconfigdirs=`echo " $noconfigdirs " | sed -e "s/ target-libffi / /"` + fi + fi +fi + notsupp="" for dir in . $skipdirs $noconfigdirs ; do dirname=`echo $dir | sed -e s/target-//g -e s/build-//g` diff --git a/gcc/builtin-types.def b/gcc/builtin-types.def index ac9894467ec..7f647c65162 100644 --- a/gcc/builtin-types.def +++ b/gcc/builtin-types.def @@ -763,6 +763,10 @@ DEF_FUNCTION_TYPE_VAR_6 (BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_VAR, BT_VOID, BT_INT, BT_PTR_FN_VOID_PTR, BT_SIZE, BT_PTR, BT_PTR, BT_PTR) +DEF_FUNCTION_TYPE_VAR_7 (BT_FN_VOID_INT_INT_OMPFN_SIZE_PTR_PTR_PTR_VAR, + BT_VOID, BT_INT, BT_INT, BT_PTR_FN_VOID_PTR, BT_SIZE, + BT_PTR, BT_PTR, BT_PTR) + DEF_FUNCTION_TYPE_VAR_7 (BT_FN_VOID_INT_SIZE_PTR_PTR_PTR_INT_INT_VAR, BT_VOID, BT_INT, BT_SIZE, BT_PTR, BT_PTR, BT_PTR, BT_INT, BT_INT) diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c index a7b4c09bf6c..55c7e3cbf90 100644 --- a/gcc/config/nvptx/nvptx.c +++ b/gcc/config/nvptx/nvptx.c @@ -4737,6 +4737,10 @@ nvptx_expand_cmp_swap (tree exp, rtx target, NULL_RTX, mode, EXPAND_NORMAL); rtx pat; + /* 'mem' might be a PARM_DECL. If so, convert it to a register. */ + if (!REG_P (mem)) + mem = copy_to_mode_reg (GET_MODE (mem), mem); + mem = gen_rtx_MEM (mode, mem); if (!REG_P (cmp)) cmp = copy_to_mode_reg (mode, cmp); diff --git a/gcc/fortran/types.def b/gcc/fortran/types.def index 1f8a5a1277c..3c3ad69d848 100644 --- a/gcc/fortran/types.def +++ b/gcc/fortran/types.def @@ -252,3 +252,7 @@ DEF_FUNCTION_TYPE_VAR_7 (BT_FN_VOID_INT_SIZE_PTR_PTR_PTR_INT_INT_VAR, DEF_FUNCTION_TYPE_VAR_6 (BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_VAR, BT_VOID, BT_INT, BT_PTR_FN_VOID_PTR, BT_SIZE, BT_PTR, BT_PTR, BT_PTR) + +DEF_FUNCTION_TYPE_VAR_7 (BT_FN_VOID_INT_INT_OMPFN_SIZE_PTR_PTR_PTR_VAR, + BT_VOID, BT_INT, BT_INT, BT_PTR_FN_VOID_PTR, BT_SIZE, + BT_PTR, BT_PTR, BT_PTR) diff --git a/gcc/omp-builtins.def b/gcc/omp-builtins.def index 69b73f4b8c4..a9ec667aa54 100644 --- a/gcc/omp-builtins.def +++ b/gcc/omp-builtins.def @@ -38,8 +38,8 @@ DEF_GOACC_BUILTIN (BUILT_IN_GOACC_DATA_END, "GOACC_data_end", DEF_GOACC_BUILTIN (BUILT_IN_GOACC_ENTER_EXIT_DATA, "GOACC_enter_exit_data", BT_FN_VOID_INT_SIZE_PTR_PTR_PTR_INT_INT_VAR, ATTR_NOTHROW_LIST) -DEF_GOACC_BUILTIN (BUILT_IN_GOACC_PARALLEL, "GOACC_parallel_keyed", - BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_VAR, +DEF_GOACC_BUILTIN (BUILT_IN_GOACC_PARALLEL, "GOACC_parallel_keyed_v2", + BT_FN_VOID_INT_INT_OMPFN_SIZE_PTR_PTR_PTR_VAR, ATTR_NOTHROW_LIST) DEF_GOACC_BUILTIN (BUILT_IN_GOACC_UPDATE, "GOACC_update", BT_FN_VOID_INT_SIZE_PTR_PTR_PTR_INT_INT_VAR, diff --git a/gcc/omp-expand.c b/gcc/omp-expand.c index bf1f127d8d6..f674c74ec82 100644 --- a/gcc/omp-expand.c +++ b/gcc/omp-expand.c @@ -7097,19 +7097,21 @@ expand_omp_target (struct omp_region *region) gomp_target *entry_stmt; gimple *stmt; edge e; - bool offloaded, data_region; + bool offloaded, data_region, oacc_parallel; entry_stmt = as_a <gomp_target *> (last_stmt (region->entry)); new_bb = region->entry; + oacc_parallel = false; offloaded = is_gimple_omp_offloaded (entry_stmt); switch (gimple_omp_target_kind (entry_stmt)) { + case GF_OMP_TARGET_KIND_OACC_PARALLEL: + oacc_parallel = true; case GF_OMP_TARGET_KIND_REGION: case GF_OMP_TARGET_KIND_UPDATE: case GF_OMP_TARGET_KIND_ENTER_DATA: case GF_OMP_TARGET_KIND_EXIT_DATA: - case GF_OMP_TARGET_KIND_OACC_PARALLEL: case GF_OMP_TARGET_KIND_OACC_KERNELS: case GF_OMP_TARGET_KIND_OACC_UPDATE: case GF_OMP_TARGET_KIND_OACC_ENTER_EXIT_DATA: @@ -7171,7 +7173,7 @@ expand_omp_target (struct omp_region *region) .OMP_DATA_I may have been converted into a different local variable. In which case, we need to keep the assignment. */ tree data_arg = gimple_omp_target_data_arg (entry_stmt); - if (data_arg) + if (data_arg && !oacc_parallel) { basic_block entry_succ_bb = single_succ (entry_bb); gimple_stmt_iterator gsi; @@ -7489,6 +7491,11 @@ expand_omp_target (struct omp_region *region) /* The maximum number used by any start_ix, without varargs. */ auto_vec<tree, 11> args; args.quick_push (device); + if (start_ix == BUILT_IN_GOACC_PARALLEL) + { + tree use_params = oacc_parallel ? integer_one_node : integer_zero_node; + args.quick_push (use_params); + } if (offloaded) args.quick_push (build_fold_addr_expr (child_fn)); args.quick_push (t1); diff --git a/gcc/omp-low.c b/gcc/omp-low.c index e790f0f1bb2..a2869e49ebd 100644 --- a/gcc/omp-low.c +++ b/gcc/omp-low.c @@ -89,6 +89,7 @@ struct omp_context /* Map variables to fields in a structure that allows communication between sending and receiving threads. */ splay_tree field_map; + splay_tree parm_map; tree record_type; tree sender_decl; tree receiver_decl; @@ -321,6 +322,14 @@ maybe_lookup_decl (const_tree var, omp_context *ctx) } static inline tree +lookup_parm (const_tree var, omp_context *ctx) +{ + splay_tree_node n; + n = splay_tree_lookup (ctx->parm_map, (splay_tree_key) var); + return (tree) n->value; +} + +static inline tree lookup_field (tree var, omp_context *ctx) { splay_tree_node n; @@ -501,15 +510,21 @@ build_receiver_ref (tree var, bool by_ref, omp_context *ctx) { tree x, field = lookup_field (var, ctx); - /* If the receiver record type was remapped in the child function, - remap the field into the new record type. */ - x = maybe_lookup_field (field, ctx); - if (x != NULL) - field = x; + if (is_oacc_parallel (ctx)) + x = lookup_parm (var, ctx); + else + { + /* If the receiver record type was remapped in the child function, + remap the field into the new record type. */ + x = maybe_lookup_field (field, ctx); + if (x != NULL) + field = x; + + x = build_simple_mem_ref (ctx->receiver_decl); + TREE_THIS_NOTRAP (x) = 1; + x = omp_build_component_ref (x, field); + } - x = build_simple_mem_ref (ctx->receiver_decl); - TREE_THIS_NOTRAP (x) = 1; - x = omp_build_component_ref (x, field); if (by_ref) { x = build_simple_mem_ref (x); @@ -644,6 +659,32 @@ build_sender_ref (tree var, omp_context *ctx) return build_sender_ref ((splay_tree_key) var, ctx); } +static void +install_parm_decl (tree var, tree type, omp_context *ctx) +{ + if (!is_oacc_parallel (ctx)) + return; + + splay_tree_key key = (splay_tree_key) var; + tree decl_name = NULL_TREE, t; + location_t loc = UNKNOWN_LOCATION; + + if (DECL_P (var)) + { + decl_name = get_identifier (get_name (var)); + loc = DECL_SOURCE_LOCATION (var); + } + t = build_decl (loc, PARM_DECL, decl_name, type); + DECL_ARTIFICIAL (t) = 1; + DECL_NAMELESS (t) = 1; + DECL_ARG_TYPE (t) = type; + DECL_CONTEXT (t) = current_function_decl; + TREE_USED (t) = 1; + TREE_READONLY (t) = 1; + + splay_tree_insert (ctx->parm_map, key, (splay_tree_value) t); +} + /* Add a new field for VAR inside the structure CTX->SENDER_DECL. If BASE_POINTERS_RESTRICT, declare the field with restrict. */ @@ -764,7 +805,10 @@ install_var_field (tree var, bool by_ref, int mask, omp_context *ctx, } if (mask & 1) - splay_tree_insert (ctx->field_map, key, (splay_tree_value) field); + { + splay_tree_insert (ctx->field_map, key, (splay_tree_value) field); + install_parm_decl (var, type, ctx); + } if ((mask & 2) && ctx->sfield_map) splay_tree_insert (ctx->sfield_map, key, (splay_tree_value) sfield); } @@ -1068,6 +1112,8 @@ delete_omp_context (splay_tree_value value) splay_tree_delete (ctx->field_map); if (ctx->sfield_map) splay_tree_delete (ctx->sfield_map); + if (ctx->parm_map) + splay_tree_delete (ctx->parm_map); /* We hijacked DECL_ABSTRACT_ORIGIN earlier. We need to clear it before it produces corrupt debug information. */ @@ -1506,6 +1552,7 @@ scan_sharing_clauses (tree clauses, omp_context *ctx, insert_field_into_struct (ctx->record_type, field); splay_tree_insert (ctx->field_map, (splay_tree_key) decl, (splay_tree_value) field); + install_parm_decl (decl, ptr_type_node, ctx); } } break; @@ -1800,10 +1847,13 @@ omp_maybe_offloaded_ctx (omp_context *ctx) } /* Build a decl for the omp child function. It'll not contain a body - yet, just the bare decl. */ + yet, just the bare decl. Unlike omp child functions, acc child + functions for parallel regions have one argument per data + mapping. */ static void -create_omp_child_function (omp_context *ctx, bool task_copy) +create_omp_child_function (omp_context *ctx, bool task_copy, + unsigned int map_cnt = 0) { tree decl, type, name, t; @@ -1825,6 +1875,13 @@ create_omp_child_function (omp_context *ctx, bool task_copy) type = build_function_type_list (void_type_node, ptr_type_node, cilk_var_type, cilk_var_type, NULL_TREE); } + else if (is_oacc_parallel (ctx)) + { + tree *arg_types = (tree *) alloca (sizeof (tree) * map_cnt); + for (unsigned int i = 0; i < map_cnt; i++) + arg_types[i] = ptr_type_node; + type = build_function_type_array (void_type_node, map_cnt, arg_types); + } else type = build_function_type_list (void_type_node, ptr_type_node, NULL_TREE); @@ -1899,35 +1956,37 @@ create_omp_child_function (omp_context *ctx, bool task_copy) DECL_ARGUMENTS (decl) = t; } - tree data_name = get_identifier (".omp_data_i"); - t = build_decl (DECL_SOURCE_LOCATION (decl), PARM_DECL, data_name, - ptr_type_node); - DECL_ARTIFICIAL (t) = 1; - DECL_NAMELESS (t) = 1; - DECL_ARG_TYPE (t) = ptr_type_node; - DECL_CONTEXT (t) = current_function_decl; - TREE_USED (t) = 1; - TREE_READONLY (t) = 1; - if (cilk_for_count) - DECL_CHAIN (t) = DECL_ARGUMENTS (decl); - DECL_ARGUMENTS (decl) = t; - if (!task_copy) - ctx->receiver_decl = t; - else + if (!is_oacc_parallel (ctx)) { - t = build_decl (DECL_SOURCE_LOCATION (decl), - PARM_DECL, get_identifier (".omp_data_o"), + tree data_name = get_identifier (".omp_data_i"); + t = build_decl (DECL_SOURCE_LOCATION (decl), PARM_DECL, data_name, ptr_type_node); DECL_ARTIFICIAL (t) = 1; DECL_NAMELESS (t) = 1; DECL_ARG_TYPE (t) = ptr_type_node; DECL_CONTEXT (t) = current_function_decl; TREE_USED (t) = 1; - TREE_ADDRESSABLE (t) = 1; - DECL_CHAIN (t) = DECL_ARGUMENTS (decl); + TREE_READONLY (t) = 1; + if (cilk_for_count) + DECL_CHAIN (t) = DECL_ARGUMENTS (decl); DECL_ARGUMENTS (decl) = t; + if (!task_copy) + ctx->receiver_decl = t; + else + { + t = build_decl (DECL_SOURCE_LOCATION (decl), + PARM_DECL, get_identifier (".omp_data_o"), + ptr_type_node); + DECL_ARTIFICIAL (t) = 1; + DECL_NAMELESS (t) = 1; + DECL_ARG_TYPE (t) = ptr_type_node; + DECL_CONTEXT (t) = current_function_decl; + TREE_USED (t) = 1; + TREE_ADDRESSABLE (t) = 1; + DECL_CHAIN (t) = DECL_ARGUMENTS (decl); + DECL_ARGUMENTS (decl) = t; + } } - /* Allocate memory for the function structure. The call to allocate_struct_function clobbers CFUN, so we need to restore it afterward. */ @@ -2608,6 +2667,7 @@ scan_omp_target (gomp_target *stmt, omp_context *outer_ctx) ctx = new_omp_context (stmt, outer_ctx); ctx->field_map = splay_tree_new (splay_tree_compare_pointers, 0, 0); + ctx->parm_map = splay_tree_new (splay_tree_compare_pointers, 0, 0); ctx->default_kind = OMP_CLAUSE_DEFAULT_SHARED; ctx->record_type = lang_hooks.types.make_type (RECORD_TYPE); name = create_tmp_var_name (".omp_data_t"); @@ -2621,8 +2681,11 @@ scan_omp_target (gomp_target *stmt, omp_context *outer_ctx) bool base_pointers_restrict = false; if (offloaded) { - create_omp_child_function (ctx, false); - gimple_omp_target_set_child_fn (stmt, ctx->cb.dst_fn); + if (!is_oacc_parallel (ctx)) + { + create_omp_child_function (ctx, false); + gimple_omp_target_set_child_fn (stmt, ctx->cb.dst_fn); + } base_pointers_restrict = omp_target_base_pointers_restrict_p (clauses); if (base_pointers_restrict @@ -7921,6 +7984,18 @@ convert_from_firstprivate_int (tree var, tree orig_type, bool is_ref, return var; } +static tree +append_decl_arg (tree var, tree decl_args, omp_context *ctx) +{ + if (!is_oacc_parallel (ctx)) + return NULL_TREE; + + tree temp = lookup_parm (var, ctx); + DECL_CHAIN (temp) = decl_args; + + return temp; +} + /* Lower the GIMPLE_OMP_TARGET in the current statement in GSI_P. CTX holds context information for the directive. */ @@ -7934,7 +8009,7 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx) gimple_seq tgt_body, olist, ilist, fplist, new_body; location_t loc = gimple_location (stmt); bool offloaded, data_region; - unsigned int map_cnt = 0; + unsigned int map_cnt = 0, init_cnt = 0; offloaded = is_gimple_omp_offloaded (stmt); switch (gimple_omp_target_kind (stmt)) @@ -7980,11 +8055,83 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx) } else if (data_region) tgt_body = gimple_omp_body (stmt); - child_fn = ctx->cb.dst_fn; push_gimplify_context (); fplist = NULL; + /* Determine init_cnt to finish initialize ctx. */ + + if (is_oacc_parallel (ctx)) + { + for (c = clauses; c ; c = OMP_CLAUSE_CHAIN (c)) + switch (OMP_CLAUSE_CODE (c)) + { + tree var; + + default: + break; + case OMP_CLAUSE_MAP: + case OMP_CLAUSE_TO: + case OMP_CLAUSE_FROM: + init_oacc_firstprivate: + var = OMP_CLAUSE_DECL (c); + if (!DECL_P (var)) + { + if (OMP_CLAUSE_CODE (c) != OMP_CLAUSE_MAP + || (!OMP_CLAUSE_MAP_ZERO_BIAS_ARRAY_SECTION (c) + && (OMP_CLAUSE_MAP_KIND (c) + != GOMP_MAP_FIRSTPRIVATE_POINTER))) + init_cnt++; + continue; + } + + if (DECL_SIZE (var) + && TREE_CODE (DECL_SIZE (var)) != INTEGER_CST) + { + tree var2 = DECL_VALUE_EXPR (var); + gcc_assert (TREE_CODE (var2) == INDIRECT_REF); + var2 = TREE_OPERAND (var2, 0); + gcc_assert (DECL_P (var2)); + var = var2; + } + + if (offloaded + && OMP_CLAUSE_CODE (c) == OMP_CLAUSE_MAP + && (OMP_CLAUSE_MAP_KIND (c) == GOMP_MAP_FIRSTPRIVATE_POINTER + || (OMP_CLAUSE_MAP_KIND (c) + == GOMP_MAP_FIRSTPRIVATE_REFERENCE))) + { + continue; + } + + if (!maybe_lookup_field (var, ctx)) + continue; + + init_cnt++; + break; + + case OMP_CLAUSE_FIRSTPRIVATE: + if (is_oacc_parallel (ctx)) + goto init_oacc_firstprivate; + init_cnt++; + break; + + case OMP_CLAUSE_USE_DEVICE_PTR: + case OMP_CLAUSE_IS_DEVICE_PTR: + init_cnt++; + break; + } + + /* Initialize the offloaded child function. */ + + create_omp_child_function (ctx, false, init_cnt); + gimple_omp_target_set_child_fn (stmt, ctx->cb.dst_fn); + } + + child_fn = ctx->cb.dst_fn; + + /* Clause Pass 1: Scan and prepare sender decls VALUE_EXPRs for + usage on the child function. */ for (c = clauses; c ; c = OMP_CLAUSE_CHAIN (c)) switch (OMP_CLAUSE_CODE (c)) { @@ -8247,6 +8394,8 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx) if (offloaded) { + if (is_oacc_parallel (ctx)) + gcc_assert (init_cnt == map_cnt); target_nesting_level++; lower_omp (&tgt_body, ctx); target_nesting_level--; @@ -8293,6 +8442,7 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx) vec_alloc (vsize, map_cnt); vec_alloc (vkind, map_cnt); unsigned int map_idx = 0; + tree decl_args = NULL_TREE; for (c = clauses; c ; c = OMP_CLAUSE_CHAIN (c)) switch (OMP_CLAUSE_CODE (c)) @@ -8488,6 +8638,7 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx) if (s == NULL_TREE) s = integer_one_node; s = fold_convert (size_type_node, s); + decl_args = append_decl_arg (ovar, decl_args, ctx); purpose = size_int (map_idx++); CONSTRUCTOR_APPEND_ELT (vsize, purpose, s); if (TREE_CODE (s) != INTEGER_CST) @@ -8628,6 +8779,7 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx) else s = TYPE_SIZE_UNIT (TREE_TYPE (ovar)); s = fold_convert (size_type_node, s); + decl_args = append_decl_arg (ovar, decl_args, ctx); purpose = size_int (map_idx++); CONSTRUCTOR_APPEND_ELT (vsize, purpose, s); if (TREE_CODE (s) != INTEGER_CST) @@ -8667,6 +8819,7 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx) } gimplify_assign (x, var, &ilist); s = size_int (0); + decl_args = append_decl_arg (ovar, decl_args, ctx); purpose = size_int (map_idx++); CONSTRUCTOR_APPEND_ELT (vsize, purpose, s); gcc_checking_assert (tkind @@ -8679,6 +8832,8 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx) } gcc_assert (map_idx == map_cnt); + if (is_oacc_parallel (ctx)) + DECL_ARGUMENTS (child_fn) = nreverse (decl_args); DECL_INITIAL (TREE_VEC_ELT (t, 1)) = build_constructor (TREE_TYPE (TREE_VEC_ELT (t, 1)), vsize); @@ -8717,9 +8872,12 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx) { t = build_fold_addr_expr_loc (loc, ctx->sender_decl); /* fixup_child_record_type might have changed receiver_decl's type. */ - t = fold_convert_loc (loc, TREE_TYPE (ctx->receiver_decl), t); - gimple_seq_add_stmt (&new_body, - gimple_build_assign (ctx->receiver_decl, t)); + if (!is_oacc_parallel (ctx)) + { + t = fold_convert_loc (loc, TREE_TYPE (ctx->receiver_decl), t); + gimple_seq_add_stmt (&new_body, + gimple_build_assign (ctx->receiver_decl, t)); + } } gimple_seq_add_seq (&new_body, fplist); diff --git a/gcc/tree-ssa-structalias.c b/gcc/tree-ssa-structalias.c index aab6821e792..c23ddeb9c86 100644 --- a/gcc/tree-ssa-structalias.c +++ b/gcc/tree-ssa-structalias.c @@ -4618,6 +4618,7 @@ find_func_aliases_for_builtin_call (struct function *fn, gcall *t) case BUILT_IN_GOMP_PARALLEL: case BUILT_IN_GOACC_PARALLEL: { + bool oacc_parallel = false; if (in_ipa_mode) { unsigned int fnpos, argpos; @@ -4631,13 +4632,17 @@ find_func_aliases_for_builtin_call (struct function *fn, gcall *t) case BUILT_IN_GOACC_PARALLEL: /* __builtin_GOACC_parallel (device, fn, mapnum, hostaddrs, sizes, kinds, ...). */ - fnpos = 1; - argpos = 3; + fnpos = 2; + argpos = 4; + oacc_parallel = gimple_call_arg (t, 1) == integer_one_node; break; default: gcc_unreachable (); } + if (oacc_parallel) + break; + tree fnarg = gimple_call_arg (t, fnpos); gcc_assert (TREE_CODE (fnarg) == ADDR_EXPR); tree fndecl = TREE_OPERAND (fnarg, 0); @@ -5195,6 +5200,7 @@ find_func_clobbers (struct function *fn, gimple *origt) unsigned int fnpos, argpos; unsigned int implicit_use_args[2]; unsigned int num_implicit_use_args = 0; + bool oacc_parallel = false; switch (DECL_FUNCTION_CODE (decl)) { case BUILT_IN_GOMP_PARALLEL: @@ -5205,15 +5211,19 @@ find_func_clobbers (struct function *fn, gimple *origt) case BUILT_IN_GOACC_PARALLEL: /* __builtin_GOACC_parallel (device, fn, mapnum, hostaddrs, sizes, kinds, ...). */ - fnpos = 1; - argpos = 3; - implicit_use_args[num_implicit_use_args++] = 4; + fnpos = 2; + argpos = 4; implicit_use_args[num_implicit_use_args++] = 5; + implicit_use_args[num_implicit_use_args++] = 6; + oacc_parallel = gimple_call_arg (t, 1) == integer_one_node; break; default: gcc_unreachable (); } + if (oacc_parallel) + break; + tree fnarg = gimple_call_arg (t, fnpos); gcc_assert (TREE_CODE (fnarg) == ADDR_EXPR); tree fndecl = TREE_OPERAND (fnarg, 0); @@ -7968,7 +7978,7 @@ ipa_pta_execute (void) if (gimple_call_builtin_p (stmt, BUILT_IN_GOMP_PARALLEL)) called_decl = TREE_OPERAND (gimple_call_arg (stmt, 0), 0); else if (gimple_call_builtin_p (stmt, BUILT_IN_GOACC_PARALLEL)) - called_decl = TREE_OPERAND (gimple_call_arg (stmt, 1), 0); + called_decl = TREE_OPERAND (gimple_call_arg (stmt, 2), 0); if (called_decl != NULL_TREE && !fndecl_maybe_in_other_partition (called_decl)) diff --git a/libgomp/Makefile.am b/libgomp/Makefile.am index 99ad2fd456d..4de30914d3d 100644 --- a/libgomp/Makefile.am +++ b/libgomp/Makefile.am @@ -13,9 +13,16 @@ search_path = $(addprefix $(top_srcdir)/config/, $(config_path)) $(top_srcdir) \ fincludedir = $(libdir)/gcc/$(target_alias)/$(gcc_version)$(MULTISUBDIR)/finclude libsubincludedir = $(libdir)/gcc/$(target_alias)/$(gcc_version)/include +LIBFFI = @LIBFFI@ +LIBFFIINCS = @LIBFFIINCS@ + +if USE_LIBFFI +libgomp_la_LIBADD = $(LIBFFI) +endif + vpath % $(strip $(search_path)) -AM_CPPFLAGS = $(addprefix -I, $(search_path)) +AM_CPPFLAGS = $(addprefix -I, $(search_path)) $(LIBFFIINCS) AM_CFLAGS = $(XCFLAGS) AM_LDFLAGS = $(XLDFLAGS) $(SECTION_LDFLAGS) $(OPT_LDFLAGS) diff --git a/libgomp/Makefile.in b/libgomp/Makefile.in index 7a84b5681e1..617615d4d52 100644 --- a/libgomp/Makefile.in +++ b/libgomp/Makefile.in @@ -171,7 +171,6 @@ libgomp_plugin_nvptx_la_LINK = $(LIBTOOL) --tag=CC \ $(libgomp_plugin_nvptx_la_LDFLAGS) $(LDFLAGS) -o $@ @PLUGIN_NVPTX_TRUE@am_libgomp_plugin_nvptx_la_rpath = -rpath \ @PLUGIN_NVPTX_TRUE@ $(toolexeclibdir) -libgomp_la_LIBADD = @USE_FORTRAN_TRUE@am__objects_1 = openacc.lo am_libgomp_la_OBJECTS = alloc.lo atomic.lo barrier.lo critical.lo \ env.lo error.lo icv.lo icv-device.lo iter.lo iter_ull.lo \ @@ -279,6 +278,8 @@ INSTALL_SCRIPT = @INSTALL_SCRIPT@ INSTALL_STRIP_PROGRAM = @INSTALL_STRIP_PROGRAM@ LD = @LD@ LDFLAGS = @LDFLAGS@ +LIBFFI = @LIBFFI@ +LIBFFIINCS = @LIBFFIINCS@ LIBOBJS = @LIBOBJS@ LIBS = @LIBS@ LIBTOOL = @LIBTOOL@ @@ -410,7 +411,8 @@ search_path = $(addprefix $(top_srcdir)/config/, $(config_path)) $(top_srcdir) \ fincludedir = $(libdir)/gcc/$(target_alias)/$(gcc_version)$(MULTISUBDIR)/finclude libsubincludedir = $(libdir)/gcc/$(target_alias)/$(gcc_version)/include -AM_CPPFLAGS = $(addprefix -I, $(search_path)) +libgomp_la_LIBADD = $(LIBFFI) +AM_CPPFLAGS = $(addprefix -I, $(search_path)) $(LIBFFIINCS) AM_CFLAGS = $(XCFLAGS) AM_LDFLAGS = $(XLDFLAGS) $(SECTION_LDFLAGS) $(OPT_LDFLAGS) toolexeclib_LTLIBRARIES = libgomp.la $(am__append_1) $(am__append_2) diff --git a/libgomp/config.h.in b/libgomp/config.h.in index 2f45aa74bbe..65e01c5376a 100644 --- a/libgomp/config.h.in +++ b/libgomp/config.h.in @@ -189,5 +189,8 @@ /* Define to 1 if the target use emutls for thread-local storage. */ #undef USE_EMUTLS +/* Define to 1 if the target requires libffi to call the offloaded funtions. */ +#undef USE_LIBFFI + /* Version number of package */ #undef VERSION diff --git a/libgomp/configure b/libgomp/configure index 11f5b0b1e1c..cc24a81372e 100755 --- a/libgomp/configure +++ b/libgomp/configure @@ -649,6 +649,10 @@ PLUGIN_NVPTX CUDA_DRIVER_LIB CUDA_DRIVER_INCLUDE offload_targets +USE_LIBFFI_FALSE +USE_LIBFFI_TRUE +LIBFFIINCS +LIBFFI libtool_VERSION ac_ct_FC FCFLAGS @@ -2655,7 +2659,6 @@ else fi - # ------- # ------- @@ -11155,7 +11158,7 @@ else lt_dlunknown=0; lt_dlno_uscore=1; lt_dlneed_uscore=2 lt_status=$lt_dlunknown cat > conftest.$ac_ext <<_LT_EOF -#line 11158 "configure" +#line 11161 "configure" #include "confdefs.h" #if HAVE_DLFCN_H @@ -11261,7 +11264,7 @@ else lt_dlunknown=0; lt_dlno_uscore=1; lt_dlneed_uscore=2 lt_status=$lt_dlunknown cat > conftest.$ac_ext <<_LT_EOF -#line 11264 "configure" +#line 11267 "configure" #include "confdefs.h" #if HAVE_DLFCN_H @@ -15137,6 +15140,28 @@ $as_echo "#define LIBGOMP_OFFLOADED_ONLY 1" >>confdefs.h fi +# Prepare libffi when necessary. + +LIBFFI= +LIBFFIINCS= +if test -d ../libffi; then + +$as_echo "#define USE_LIBFFI 1" >>confdefs.h + + LIBFFI=../libffi/libffi_convenience.la + LIBFFIINCS='-I$(top_srcdir)/../libffi/include -I../libffi/include' +fi + + + if test -d ../libffi; then + USE_LIBFFI_TRUE= + USE_LIBFFI_FALSE='#' +else + USE_LIBFFI_TRUE='#' + USE_LIBFFI_FALSE= +fi + + # Plugins for offload execution, configure.ac fragment. -*- mode: autoconf -*- # # Copyright (C) 2014-2017 Free Software Foundation, Inc. @@ -16960,6 +16985,10 @@ if test -z "${MAINTAINER_MODE_TRUE}" && test -z "${MAINTAINER_MODE_FALSE}"; then as_fn_error "conditional \"MAINTAINER_MODE\" was never defined. Usually this means the macro was only invoked conditionally." "$LINENO" 5 fi +if test -z "${USE_LIBFFI_TRUE}" && test -z "${USE_LIBFFI_FALSE}"; then + as_fn_error "conditional \"USE_LIBFFI\" was never defined. +Usually this means the macro was only invoked conditionally." "$LINENO" 5 +fi if test -z "${PLUGIN_NVPTX_TRUE}" && test -z "${PLUGIN_NVPTX_FALSE}"; then as_fn_error "conditional \"PLUGIN_NVPTX\" was never defined. Usually this means the macro was only invoked conditionally." "$LINENO" 5 diff --git a/libgomp/configure.ac b/libgomp/configure.ac index a42d4f08b4b..aa49577537e 100644 --- a/libgomp/configure.ac +++ b/libgomp/configure.ac @@ -28,7 +28,6 @@ LIBGOMP_ENABLE(generated-files-in-srcdir, no, , AC_MSG_RESULT($enable_generated_files_in_srcdir) AM_CONDITIONAL(GENINSRC, test "$enable_generated_files_in_srcdir" = yes) - # ------- # ------- @@ -215,6 +214,19 @@ if test x$libgomp_offloaded_only = xyes; then [Define to 1 if building libgomp for an accelerator-only target.]) fi +# Prepare libffi when necessary. + +LIBFFI= +LIBFFIINCS= +if test -d ../libffi; then + AC_DEFINE(USE_LIBFFI, 1, [Define if we're to use libffi.]) + LIBFFI=../libffi/libffi_convenience.la + LIBFFIINCS='-I$(top_srcdir)/../libffi/include -I../libffi/include' +fi +AC_SUBST(LIBFFI) +AC_SUBST(LIBFFIINCS) +AM_CONDITIONAL([USE_LIBFFI], [test -d ../libffi]) + m4_include([plugin/configfrag.ac]) # Check for functions needed. diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h index c025069b457..44097cfd56a 100644 --- a/libgomp/libgomp-plugin.h +++ b/libgomp/libgomp-plugin.h @@ -119,6 +119,13 @@ extern void GOMP_OFFLOAD_openacc_exec (void (*) (void *), size_t, void **, extern void GOMP_OFFLOAD_openacc_async_exec (void (*) (void *), size_t, void **, void **, unsigned *, void *, struct goacc_asyncqueue *); +extern void GOMP_OFFLOAD_openacc_exec_params (void (*) (void *), size_t, + void **, void **, unsigned *, + void *); +extern void GOMP_OFFLOAD_openacc_async_exec_params (void (*) (void *), size_t, + void **, void **, + unsigned *, void *, + struct goacc_asyncqueue *); extern struct goacc_asyncqueue *GOMP_OFFLOAD_openacc_async_construct (void); extern bool GOMP_OFFLOAD_openacc_async_destruct (struct goacc_asyncqueue *); extern int GOMP_OFFLOAD_openacc_async_test (struct goacc_asyncqueue *); diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h index 59e7ca8b8c8..a31c83cc656 100644 --- a/libgomp/libgomp.h +++ b/libgomp/libgomp.h @@ -885,6 +885,7 @@ typedef struct acc_dispatch_t /* Execute. */ __typeof (GOMP_OFFLOAD_openacc_exec) *exec_func; + __typeof (GOMP_OFFLOAD_openacc_exec_params) *exec_params_func; struct { gomp_mutex_t lock; @@ -900,6 +901,7 @@ typedef struct acc_dispatch_t __typeof (GOMP_OFFLOAD_openacc_async_queue_callback) *queue_callback_func; __typeof (GOMP_OFFLOAD_openacc_async_exec) *exec_func; + __typeof (GOMP_OFFLOAD_openacc_async_exec_params) *exec_params_func; __typeof (GOMP_OFFLOAD_openacc_async_host2dev) *host2dev_func; __typeof (GOMP_OFFLOAD_openacc_async_dev2host) *dev2host_func; } async; diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map index 546ac929a0e..7a49acc1dfe 100644 --- a/libgomp/libgomp.map +++ b/libgomp/libgomp.map @@ -461,8 +461,10 @@ GOACC_2.0.1 { GOACC_2.0.GOMP_4_BRANCH { global: GOMP_set_offload_targets; + GOACC_parallel_keyed_v2; } GOACC_2.0.1; + GOMP_PLUGIN_1.0 { global: GOMP_PLUGIN_malloc; diff --git a/libgomp/libgomp_g.h b/libgomp/libgomp_g.h index 958ca6e9cc3..c40e67f2e80 100644 --- a/libgomp/libgomp_g.h +++ b/libgomp/libgomp_g.h @@ -298,6 +298,8 @@ extern void GOMP_teams (unsigned int, unsigned int); extern void GOACC_parallel_keyed (int, void (*) (void *), size_t, void **, size_t *, unsigned short *, ...); +extern void GOACC_parallel_keyed_v2 (int, int, void (*) (void *), size_t, + void **, size_t *, unsigned short *, ...); extern void GOACC_parallel (int, void (*) (void *), size_t, void **, size_t *, unsigned short *, int, int, int, int, int, ...); extern void GOACC_data_start (int, size_t, void **, size_t *, diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c index 3b2cafb2c55..5b4e34d7190 100644 --- a/libgomp/oacc-host.c +++ b/libgomp/oacc-host.c @@ -158,6 +158,30 @@ host_openacc_async_exec (void (*fn) (void *), fn (hostaddrs); } +static void +host_openacc_exec_params (void (*fn) (void *), + size_t mapnum __attribute__ ((unused)), + void **hostaddrs, + void **devaddrs __attribute__ ((unused)), + unsigned *dims __attribute__ ((unused)), + void *targ_mem_desc __attribute__ ((unused))) +{ + fn (hostaddrs); +} + +static void +host_openacc_async_exec_params (void (*fn) (void *), + size_t mapnum __attribute__ ((unused)), + void **hostaddrs, + void **devaddrs __attribute__ ((unused)), + unsigned *dims __attribute__ ((unused)), + void *targ_mem_desc __attribute__ ((unused)), + struct goacc_asyncqueue *aq __attribute__ ((unused))) +{ + fn (hostaddrs); +} + + static int host_openacc_async_test (struct goacc_asyncqueue *aq __attribute__ ((unused))) { @@ -265,6 +289,7 @@ static struct gomp_device_descr host_dispatch = .data_environ = NULL, .exec_func = host_openacc_exec, + .exec_params_func = host_openacc_exec_params, .async = { .construct_func = host_openacc_async_construct, @@ -274,6 +299,7 @@ static struct gomp_device_descr host_dispatch = .serialize_func = host_openacc_async_serialize, .queue_callback_func = host_openacc_async_queue_callback, .exec_func = host_openacc_async_exec, + .exec_params_func = host_openacc_async_exec_params, .dev2host_func = host_openacc_async_dev2host, .host2dev_func = host_openacc_async_host2dev, }, diff --git a/libgomp/oacc-parallel.c b/libgomp/oacc-parallel.c index 1172d739ec7..3c5aa24b5f5 100644 --- a/libgomp/oacc-parallel.c +++ b/libgomp/oacc-parallel.c @@ -31,6 +31,9 @@ #include "libgomp_g.h" #include "gomp-constants.h" #include "oacc-int.h" +#if USE_LIBFFI +# include "ffi.h" +#endif #ifdef HAVE_INTTYPES_H # include <inttypes.h> /* For PRIu64. */ #endif @@ -104,19 +107,47 @@ handle_ftn_pointers (size_t mapnum, void **hostaddrs, size_t *sizes, static void goacc_wait (int async, int num_waits, va_list *ap); +static void +goacc_call_host_fn (void (*fn) (void *), size_t mapnum, void **hostaddrs, + int params) +{ +#ifdef USE_LIBFFI + ffi_cif cif; + ffi_type *arg_types[mapnum]; + void *arg_values[mapnum]; + ffi_arg result; + int i; + + if (params) + { + for (i = 0; i < mapnum; i++) + { + arg_types[i] = &ffi_type_pointer; + arg_values[i] = &hostaddrs[i]; + } + + if (ffi_prep_cif (&cif, FFI_DEFAULT_ABI, mapnum, + &ffi_type_void, arg_types) == FFI_OK) + ffi_call (&cif, FFI_FN (fn), &result, arg_values); + else + abort (); + } + else +#endif + fn (hostaddrs); +} /* Launch a possibly offloaded function on DEVICE. FN is the host fn address. MAPNUM, HOSTADDRS, SIZES & KINDS describe the memory blocks to be copied to/from the device. Varadic arguments are keyed optional parameters terminated with a zero. */ -void -GOACC_parallel_keyed (int device, void (*fn) (void *), - size_t mapnum, void **hostaddrs, size_t *sizes, - unsigned short *kinds, ...) +static void +GOACC_parallel_keyed_internal (int device, int params, void (*fn) (void *), + size_t mapnum, void **hostaddrs, size_t *sizes, + unsigned short *kinds, va_list *ap) { bool host_fallback = device == GOMP_DEVICE_HOST_FALLBACK; - va_list ap; struct goacc_thread *thr; struct gomp_device_descr *acc_dev; struct target_mem_desc *tgt; @@ -206,13 +237,13 @@ GOACC_parallel_keyed (int device, void (*fn) (void *), prof_info.device_type = acc_device_host; api_info.device_type = prof_info.device_type; goacc_save_and_set_bind (acc_device_host); - fn (hostaddrs); + goacc_call_host_fn (fn, mapnum, hostaddrs, params); goacc_restore_bind (); goto out; } else if (acc_device_type (acc_dev->type) == acc_device_host) { - fn (hostaddrs); + goacc_call_host_fn (fn, mapnum, hostaddrs, params); goto out; } else if (profiling_dispatch_p) @@ -222,9 +253,8 @@ GOACC_parallel_keyed (int device, void (*fn) (void *), for (i = 0; i != GOMP_DIM_MAX; i++) dims[i] = 0; - va_start (ap, kinds); /* TODO: This will need amending when device_type is implemented. */ - while ((tag = va_arg (ap, unsigned)) != 0) + while ((tag = va_arg (*ap, unsigned)) != 0) { if (GOMP_LAUNCH_DEVICE (tag)) gomp_fatal ("device_type '%d' offload parameters, libgomp is too old", @@ -238,7 +268,7 @@ GOACC_parallel_keyed (int device, void (*fn) (void *), for (i = 0; i != GOMP_DIM_MAX; i++) if (mask & GOMP_DIM_MASK (i)) - dims[i] = va_arg (ap, unsigned); + dims[i] = va_arg (*ap, unsigned); } break; @@ -248,7 +278,7 @@ GOACC_parallel_keyed (int device, void (*fn) (void *), async = GOMP_LAUNCH_OP (tag); if (async == GOMP_LAUNCH_OP_MAX) - async = va_arg (ap, unsigned); + async = va_arg (*ap, unsigned); if (profiling_dispatch_p) { @@ -267,7 +297,7 @@ GOACC_parallel_keyed (int device, void (*fn) (void *), int num_waits = ((signed short) GOMP_LAUNCH_OP (tag)); if (num_waits > 0) - goacc_wait (async, num_waits, &ap); + goacc_wait (async, num_waits, ap); else if (num_waits == acc_async_noval) acc_wait_all_async (async); break; @@ -278,7 +308,6 @@ GOACC_parallel_keyed (int device, void (*fn) (void *), " libgomp is too old", GOMP_LAUNCH_CODE (tag)); } } - va_end (ap); if (!(acc_dev->capabilities & GOMP_OFFLOAD_CAP_NATIVE_EXEC)) { @@ -338,8 +367,12 @@ GOACC_parallel_keyed (int device, void (*fn) (void *), if (aq == NULL) { - acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs, - dims, tgt); + if (params) + acc_dev->openacc.exec_params_func (tgt_fn, mapnum, hostaddrs, devaddrs, + dims, tgt); + else + acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs, + dims, tgt); if (profiling_dispatch_p) { prof_info.event_type = acc_ev_exit_data_start; @@ -362,8 +395,12 @@ GOACC_parallel_keyed (int device, void (*fn) (void *), } else { - acc_dev->openacc.async.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs, - dims, tgt, aq); + if (params) + acc_dev->openacc.async.exec_params_func (tgt_fn, mapnum, hostaddrs, + devaddrs, dims, tgt, aq); + else + acc_dev->openacc.async.exec_func (tgt_fn, mapnum, hostaddrs, + devaddrs, dims, tgt, aq); goacc_async_copyout_unmap_vars (tgt, aq); } @@ -381,6 +418,30 @@ GOACC_parallel_keyed (int device, void (*fn) (void *), } } +void +GOACC_parallel_keyed (int device, void (*fn) (void *), + size_t mapnum, void **hostaddrs, size_t *sizes, + unsigned short *kinds, ...) +{ + va_list ap; + va_start (ap, kinds); + GOACC_parallel_keyed_internal (device, 0, fn, mapnum, hostaddrs, sizes, + kinds, &ap); + va_end (ap); +} + +void +GOACC_parallel_keyed_v2 (int device, int args, void (*fn) (void *), + size_t mapnum, void **hostaddrs, size_t *sizes, + unsigned short *kinds, ...) +{ + va_list ap; + va_start (ap, kinds); + GOACC_parallel_keyed_internal (device, args, fn, mapnum, hostaddrs, sizes, + kinds, &ap); + va_end (ap); +} + /* Legacy entry point, only provide host execution. */ void diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c index 94abfe2036f..bdc0c30e1f5 100644 --- a/libgomp/plugin/plugin-nvptx.c +++ b/libgomp/plugin/plugin-nvptx.c @@ -697,12 +697,11 @@ link_ptx (CUmodule *module, const struct targ_ptx_obj *ptx_objs, static void nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs, unsigned *dims, void *targ_mem_desc, - CUdeviceptr dp, CUstream stream) + void **kargs, CUstream stream) { struct targ_fn_descriptor *targ_fn = (struct targ_fn_descriptor *) fn; CUfunction function; int i; - void *kargs[1]; int cpu_size = nvptx_thread ()->ptx_dev->max_threads_per_multiprocessor; int block_size = nvptx_thread ()->ptx_dev->max_threads_per_block; int dev_size = nvptx_thread ()->ptx_dev->num_sms; @@ -888,7 +887,6 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs, api_info); } - kargs[0] = &dp; CUDA_CALL_ASSERT (cuLaunchKernel, function, dims[GOMP_DIM_GANG], 1, 1, dims[GOMP_DIM_VECTOR], dims[GOMP_DIM_WORKER], 1, @@ -1293,22 +1291,29 @@ GOMP_OFFLOAD_free (int ord, void *ptr) && nvptx_free (ptr, ptx_devices[ord])); } -void -GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), size_t mapnum, - void **hostaddrs, void **devaddrs, - unsigned *dims, void *targ_mem_desc) +static void +openacc_exec_internal (void (*fn) (void *), int params, size_t mapnum, + void **hostaddrs, void **devaddrs, + unsigned *dims, void *targ_mem_desc) { GOMP_PLUGIN_debug (0, " %s: prepare mappings\n", __FUNCTION__); - void **hp = NULL; + void **hp = alloca (mapnum * sizeof (void *)); CUdeviceptr dp = 0; if (mapnum > 0) { - hp = alloca (mapnum * sizeof (void *)); - for (int i = 0; i < mapnum; i++) - hp[i] = (devaddrs[i] ? devaddrs[i] : hostaddrs[i]); - CUDA_CALL_ASSERT (cuMemAlloc, &dp, mapnum * sizeof (void *)); + if (params) + { + for (int i = 0; i < mapnum; i++) + hp[i] = (devaddrs[i] ? &devaddrs[i] : &hostaddrs[i]); + } + else + { + for (int i = 0; i < mapnum; i++) + hp[i] = (devaddrs[i] ? devaddrs[i] : hostaddrs[i]); + CUDA_CALL_ASSERT (cuMemAlloc, &dp, mapnum * sizeof (void *)); + } } /* Copy the (device) pointers to arguments to the device (dp and hp might in @@ -1333,7 +1338,8 @@ GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), size_t mapnum, data_event_info.data_event.var_name = NULL; //TODO data_event_info.data_event.bytes = mapnum * sizeof (void *); data_event_info.data_event.host_ptr = hp; - data_event_info.data_event.device_ptr = (void *) dp; + if (!params) + data_event_info.data_event.device_ptr = (void *) dp; api_info->device_api = acc_device_api_cuda; @@ -1341,7 +1347,7 @@ GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), size_t mapnum, api_info); } - if (mapnum > 0) + if (!params && mapnum > 0) CUDA_CALL_ASSERT (cuMemcpyHtoD, dp, (void *) hp, mapnum * sizeof (void *)); @@ -1353,8 +1359,15 @@ GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), size_t mapnum, api_info); } - nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc, - dp, NULL); + if (params) + nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc, + hp, NULL); + else + { + void *kargs[1] = { &dp }; + nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc, + kargs, NULL); + } CUresult r = cuStreamSynchronize (NULL); const char *maybe_abort_msg = "(perhaps abort was called)"; @@ -1363,7 +1376,27 @@ GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), size_t mapnum, maybe_abort_msg); else if (r != CUDA_SUCCESS) GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s", cuda_error (r)); - CUDA_CALL_ASSERT (cuMemFree, dp); + + if (!params) + CUDA_CALL_ASSERT (cuMemFree, dp); +} + +void +GOMP_OFFLOAD_openacc_exec_params (void (*fn) (void *), size_t mapnum, + void **hostaddrs, void **devaddrs, + unsigned *dims, void *targ_mem_desc) +{ + openacc_exec_internal (fn, 1, mapnum, hostaddrs, devaddrs, dims, + targ_mem_desc); +} + +void +GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), size_t mapnum, + void **hostaddrs, void **devaddrs, + unsigned *dims, void *targ_mem_desc) +{ + openacc_exec_internal (fn, 0, mapnum, hostaddrs, devaddrs, dims, + targ_mem_desc); } static void @@ -1374,11 +1407,11 @@ cuda_free_argmem (void *ptr) free (block); } -void -GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *), size_t mapnum, - void **hostaddrs, void **devaddrs, - unsigned *dims, void *targ_mem_desc, - struct goacc_asyncqueue *aq) +static void +openacc_async_exec_internal (void (*fn) (void *), int params, size_t mapnum, + void **hostaddrs, void **devaddrs, + unsigned *dims, void *targ_mem_desc, + struct goacc_asyncqueue *aq) { GOMP_PLUGIN_debug (0, " %s: prepare mappings\n", __FUNCTION__); @@ -1388,11 +1421,20 @@ GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *), size_t mapnum, if (mapnum > 0) { - block = (void **) GOMP_PLUGIN_malloc ((mapnum + 2) * sizeof (void *)); - hp = block + 2; - for (int i = 0; i < mapnum; i++) - hp[i] = (devaddrs[i] ? devaddrs[i] : hostaddrs[i]); - CUDA_CALL_ASSERT (cuMemAlloc, &dp, mapnum * sizeof (void *)); + if (params) + { + hp = alloca (sizeof (void *) * mapnum); + for (int i = 0; i < mapnum; i++) + hp[i] = (devaddrs[i] ? &devaddrs[i] : &hostaddrs[i]); + } + else + { + block = (void **) GOMP_PLUGIN_malloc ((mapnum + 2) * sizeof (void *)); + hp = block + 2; + for (int i = 0; i < mapnum; i++) + hp[i] = (devaddrs[i] ? devaddrs[i] : hostaddrs[i]); + CUDA_CALL_ASSERT (cuMemAlloc, &dp, mapnum * sizeof (void *)); + } } /* Copy the (device) pointers to arguments to the device (dp and hp might in @@ -1417,7 +1459,8 @@ GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *), size_t mapnum, data_event_info.data_event.var_name = NULL; //TODO data_event_info.data_event.bytes = mapnum * sizeof (void *); data_event_info.data_event.host_ptr = hp; - data_event_info.data_event.device_ptr = (void *) dp; + if (!params) + data_event_info.data_event.device_ptr = (void *) dp; api_info->device_api = acc_device_api_cuda; @@ -1425,7 +1468,7 @@ GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *), size_t mapnum, api_info); } - if (mapnum > 0) + if (!params && mapnum > 0) { CUDA_CALL_ASSERT (cuMemcpyHtoDAsync, dp, (void *) hp, mapnum * sizeof (void *), aq->cuda_stream); @@ -1443,14 +1486,42 @@ GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *), size_t mapnum, GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info, api_info); } - - nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc, - dp, aq->cuda_stream); - if (mapnum > 0) + if (params) + nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc, + hp, aq->cuda_stream); + else + { + void *kargs[1] = { &dp }; + nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc, + kargs, aq->cuda_stream); + } + + if (!params && mapnum > 0) GOMP_OFFLOAD_openacc_async_queue_callback (aq, cuda_free_argmem, block); } +void +GOMP_OFFLOAD_openacc_async_exec_params (void (*fn) (void *), size_t mapnum, + void **hostaddrs, void **devaddrs, + unsigned *dims, void *targ_mem_desc, + struct goacc_asyncqueue *aq) +{ + openacc_async_exec_internal (fn, 1, mapnum, hostaddrs, devaddrs, dims, + targ_mem_desc, aq); +} + +void +GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *), size_t mapnum, + void **hostaddrs, void **devaddrs, + unsigned *dims, void *targ_mem_desc, + struct goacc_asyncqueue *aq) +{ + openacc_async_exec_internal (fn, 0, mapnum, hostaddrs, devaddrs, dims, + targ_mem_desc, aq); +} + + void * GOMP_OFFLOAD_openacc_create_thread_data (int ord) { diff --git a/libgomp/target.c b/libgomp/target.c index 336581d2196..10c5e34f378 100644 --- a/libgomp/target.c +++ b/libgomp/target.c @@ -2908,6 +2908,7 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device, if (device->capabilities & GOMP_OFFLOAD_CAP_OPENACC_200) { if (!DLSYM_OPT (openacc.exec, openacc_exec) + || !DLSYM_OPT (openacc.exec_params, openacc_exec_params) || !DLSYM_OPT (openacc.create_thread_data, openacc_create_thread_data) || !DLSYM_OPT (openacc.destroy_thread_data, @@ -2920,6 +2921,7 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device, || !DLSYM_OPT (openacc.async.queue_callback, openacc_async_queue_callback) || !DLSYM_OPT (openacc.async.exec, openacc_async_exec) + || !DLSYM_OPT (openacc.async.exec_params, openacc_async_exec_params) || !DLSYM_OPT (openacc.async.dev2host, openacc_async_dev2host) || !DLSYM_OPT (openacc.async.host2dev, openacc_async_host2dev)) { diff --git a/libgomp/testsuite/Makefile.in b/libgomp/testsuite/Makefile.in index 6edb7ae7ade..4d7f43abe3d 100644 --- a/libgomp/testsuite/Makefile.in +++ b/libgomp/testsuite/Makefile.in @@ -120,6 +120,8 @@ INSTALL_SCRIPT = @INSTALL_SCRIPT@ INSTALL_STRIP_PROGRAM = @INSTALL_STRIP_PROGRAM@ LD = @LD@ LDFLAGS = @LDFLAGS@ +LIBFFI = @LIBFFI@ +LIBFFIINCS = @LIBFFIINCS@ LIBOBJS = @LIBOBJS@ LIBS = @LIBS@ LIBTOOL = @LIBTOOL@ diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c index dad6d13eb60..c6abc1d724a 100644 --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c @@ -1,6 +1,11 @@ /* This test exercises combined directives. */ +/* This test falls back to host execution because struct alias + analysis is deactivated on OpenACC parallel regions. Consequently, + parloops can no longer disambiguate arrays a and b. */ + /* { dg-do run } */ +/* { dg-xfail-if "n/a" { openacc_nvidia_accel_selected } { "-O2" } { "" } } */ #include <stdlib.h>