(This patch serves as a straw man proposal to have something concrete for discussion and further patches)
On PTX, stack memory is private to each thread. When master thread constructs 'omp_data_o' on its own stack and passes it to other threads via GOMP_parallel by reference, other threads cannot use the resulting pointer. We need to arrange structures passed between threads be in global, or better, in PTX __shared__ memory (private to each CUDA thread block). We cannot easily adjust expansion of 'omp parallel' because it is done before LTO streamout. I've opted to adjust calls to GOMP_parallel in pass_late_lower_omp instead. As I see, there are two possible approaches. Either arrange the structure be in shared memory from the compiler, or have GOMP_parallel perform the copies. The latter requires passing sizeof(omp_data_o) to GOMP_parallel, and also to GOMP_OFFLOAD_run (to reserve shared memory), so doing it from the compiler seems simpler. Using static storage may preclude nested parallelism. Not sure we want to support it for offloading anyway (but there needs to be a clear decision). Using separate variables is wasteful: they should go into a union to reduce shared memory consumption. * omp-low.c (expand_parallel_call): Mark function for pass_late_lower_omp transforms. (pass_late_lower_omp::execute): Copy omp_data_o to/from 'shared' memory on NVPTX. --- gcc/omp-low.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 46 insertions(+), 7 deletions(-) diff --git a/gcc/omp-low.c b/gcc/omp-low.c index 6481163..5b75bf6 100644 --- a/gcc/omp-low.c +++ b/gcc/omp-low.c @@ -5384,7 +5384,10 @@ expand_parallel_call (struct omp_region *region, basic_block bb, if (t == NULL) t1 = null_pointer_node; else - t1 = build_fold_addr_expr (t); + { + t1 = build_fold_addr_expr (t); + cfun->curr_properties &= ~PROP_gimple_lompifn; + } t2 = build_fold_addr_expr (gimple_omp_parallel_child_fn (entry_stmt)); vec_alloc (args, 4 + vec_safe_length (ws_args)); @@ -14703,15 +14706,51 @@ pass_late_lower_omp::execute (function *fun) for (i = gsi_start_bb (bb); !gsi_end_p (i); gsi_next (&i)) { gimple stmt = gsi_stmt (i); - if (!(is_gimple_call (stmt) - && gimple_call_internal_p (stmt) - && gimple_call_internal_fn (stmt) == IFN_GOACC_DATA_END_WITH_ARG)) + + if (!is_gimple_call (stmt)) continue; - tree fn = builtin_decl_explicit (BUILT_IN_GOACC_DATA_END); - gimple g = gimple_build_call (fn, 0); +#ifdef ADDR_SPACE_SHARED + /* Transform "GOMP_parallel (fn, &omp_data_o, ...)" call to + + static __shared__ typeof(omp_data_o) omp_data_shared; + omp_data_shared = omp_data_o; + GOMP_parallel(fn, &omp_data_shared, ...); + omp_data_o = omp_data_shared; */ + if (gimple_call_builtin_p (stmt, BUILT_IN_GOMP_PARALLEL)) + { + tree omp_data_ptr = gimple_call_arg (stmt, 1); + if (TREE_CODE (omp_data_ptr) == ADDR_EXPR) + { + tree omp_data = TREE_OPERAND (omp_data_ptr, 0); + tree type = TREE_TYPE (omp_data); + int quals = ENCODE_QUAL_ADDR_SPACE (ADDR_SPACE_SHARED); + type = build_qualified_type (type, quals); + tree decl = create_tmp_var (type, "omp_data_shared"); + TREE_STATIC (decl) = 1; + TREE_ADDRESSABLE (decl) = 1; + varpool_node::finalize_decl (decl); + + gimple g = gimple_build_assign (decl, omp_data); + gsi_insert_before (&i, g, GSI_SAME_STMT); + + g = gimple_build_assign (omp_data, decl); + gsi_insert_after (&i, g, GSI_NEW_STMT); + + gimple_call_set_arg (stmt, 1, build_fold_addr_expr (decl)); + } + continue; + } +#endif + + if (gimple_call_internal_p (stmt) + && gimple_call_internal_fn (stmt) == IFN_GOACC_DATA_END_WITH_ARG) + { + tree fn = builtin_decl_explicit (BUILT_IN_GOACC_DATA_END); + gimple g = gimple_build_call (fn, 0); - gsi_replace (&i, g, false); + gsi_replace (&i, g, false); + } } return TODO_update_ssa;