[PATCH, OpenACC 2.7] Implement host_data must have use_device clause requirement
Hi Thomas, this patch implements the OpenACC 2.7 change requiring the host_data construct to have at least one use_device clause. This patch started out with a simple check during gimplify (much smaller patch), but turned out that front-ends removed use_device clauses when they have error, and the gimplify check started to echo a "no use_device clause" message in such cases, which seem confusing for the user. So ended up adding the check in each front-end instead. Tested on powerpc64le-linux/nvptx, x86_64-linux/amdgcn tests in progress (expect no surprises). Is this okay for trunk? Thanks, Chung-Lin gcc/c/ChangeLog: * c-parser.cc (c_parser_oacc_host_data): Add checking requiring OpenACC host_data construct to have an use_device clause. gcc/cp/ChangeLog: * parser.cc (cp_parser_oacc_host_data): Add checking requiring OpenACC host_data construct to have an use_device clause. gcc/fortran/ChangeLog: * trans-openmp.cc (gfc_trans_oacc_construct): Add checking requiring OpenACC host_data construct to have an use_device clause. gcc/testsuite/ChangeLog: * c-c++-common/goacc/host_data-2.c: Adjust testcase. * gfortran.dg/goacc/host_data-error.f90: New testcase. * gfortran.dg/goacc/pr71704.f90: Adjust testcase. From 0d17b8d24fa6079d6c289305e9644c3fecd429f1 Mon Sep 17 00:00:00 2001 From: Chung-Lin Tang Date: Tue, 6 Jun 2023 03:19:33 -0700 Subject: [PATCH 1/2] OpenACC 2.7: host_data must have use_device clause requirement This patch implements the OpenACC 2.7 change requiring the host_data construct to have at least one use_device clause. gcc/c/ChangeLog: * c-parser.cc (c_parser_oacc_host_data): Add checking requiring OpenACC host_data construct to have an use_device clause. gcc/cp/ChangeLog: * parser.cc (cp_parser_oacc_host_data): Add checking requiring OpenACC host_data construct to have an use_device clause. gcc/fortran/ChangeLog: * trans-openmp.cc (gfc_trans_oacc_construct): Add checking requiring OpenACC host_data construct to have an use_device clause. gcc/testsuite/ChangeLog: * c-c++-common/goacc/host_data-2.c: Adjust testcase. * gfortran.dg/goacc/host_data-error.f90: New testcase. * gfortran.dg/goacc/pr71704.f90: Adjust testcase. --- gcc/c/c-parser.cc | 9 +++-- gcc/cp/parser.cc| 11 +-- gcc/fortran/trans-openmp.cc | 6 ++ gcc/testsuite/c-c++-common/goacc/host_data-2.c | 7 ++- gcc/testsuite/gfortran.dg/goacc/host_data-error.f90 | 6 ++ gcc/testsuite/gfortran.dg/goacc/pr71704.f90 | 5 +++-- 6 files changed, 37 insertions(+), 7 deletions(-) create mode 100644 gcc/testsuite/gfortran.dg/goacc/host_data-error.f90 diff --git a/gcc/c/c-parser.cc b/gcc/c/c-parser.cc index 5baa501dbee..b61aef8b1a2 100644 --- a/gcc/c/c-parser.cc +++ b/gcc/c/c-parser.cc @@ -18398,8 +18398,13 @@ c_parser_oacc_host_data (location_t loc, c_parser *parser, bool *if_p) tree stmt, clauses, block; clauses = c_parser_oacc_all_clauses (parser, OACC_HOST_DATA_CLAUSE_MASK, - "#pragma acc host_data"); - + "#pragma acc host_data", false); + if (!omp_find_clause (clauses, OMP_CLAUSE_USE_DEVICE_PTR)) +{ + error_at (loc, "% construct requires % clause"); + return error_mark_node; +} + clauses = c_finish_omp_clauses (clauses, C_ORT_ACC); block = c_begin_omp_parallel (); add_stmt (c_parser_omp_structured_block (parser, if_p)); stmt = c_finish_oacc_host_data (loc, clauses, block); diff --git a/gcc/cp/parser.cc b/gcc/cp/parser.cc index 1c9aa671851..dd7638f1c93 100644 --- a/gcc/cp/parser.cc +++ b/gcc/cp/parser.cc @@ -45798,8 +45798,15 @@ cp_parser_oacc_host_data (cp_parser *parser, cp_token *pragma_tok, bool *if_p) unsigned int save; clauses = cp_parser_oacc_all_clauses (parser, OACC_HOST_DATA_CLAUSE_MASK, - "#pragma acc host_data", pragma_tok); - + "#pragma acc host_data", pragma_tok, + false); + if (!omp_find_clause (clauses, OMP_CLAUSE_USE_DEVICE_PTR)) +{ + error_at (pragma_tok->location, + "% construct requires % clause"); + return error_mark_node; +} + clauses = finish_omp_clauses (clauses, C_ORT_ACC); block = begin_omp_parallel (); save = cp_parser_begin_omp_structured_block (parser); cp_parser_statement (parser, NULL_TREE, false, if_p); diff --git a/gcc/fortran/trans-openmp.cc b/gcc/fortran/trans-openmp.cc index 42b608f3d36..5e0079cce76 100644 --- a/gcc/fortran/trans-openmp.cc +++ b/gcc/fortran/trans-openmp.cc @@ -4677,6 +4677,12 @@ gfc_trans_oacc_construct (gfc_code *code) break; case EXEC_OACC_HOST_DATA: construct_code = OACC_HOST_DATA; + if (code->ex
[PATCH, OpenACC 2.7] Implement default clause support for data constructs
Hi Thomas, this patch implements the OpenACC 2.7 addition of default(none|present) support for data constructs. Apart from adjusting the front-ends for allowed clauses masks (for acc data), mostly implemented in gimplify. Tested on powerpc64le-linux/nvptx, x86_64-linux/amdgcn tests in progress (expect no surprises). Is this okay for trunk? Thanks, Chung-Lin gcc/c/ChangeLog: * c-parser.cc (OACC_DATA_CLAUSE_MASK): Add PRAGMA_OACC_CLAUSE_DEFAULT. gcc/cp/ChangeLog: * parser.cc (OACC_DATA_CLAUSE_MASK): Add PRAGMA_OACC_CLAUSE_DEFAULT. gcc/fortran/ChangeLog: * openmp.cc (OACC_DATA_CLAUSES): Add OMP_CLAUSE_DEFAULT. gcc/ChangeLog: * gimplify.cc (struct gimplify_omp_ctx): Add oacc_data_default_kind field. (new_omp_context): Initialize oacc_data_default_kind field. (gimplify_scan_omp_clauses): Set oacc_data_default_kind for data constructs. Set ctx->default_kind for compute constructs from ctx->oacc_data_default_kind. gcc/testsuite/ChangeLog: * c-c++-common/goacc/default-3.c: Adjust testcase. * c-c++-common/goacc/default-5.c: Adjust testcase. * gfortran.dg/goacc/default-3.f95: Adjust testcase. * gfortran.dg/goacc/default-5.f: Adjust testcase. From 101305aee9b27c6df00d7c403e469bdf8d7f45a4 Mon Sep 17 00:00:00 2001 From: Chung-Lin Tang Date: Tue, 6 Jun 2023 03:46:29 -0700 Subject: [PATCH 2/2] OpenACC 2.7: default clause support for data constructs This patch implements the OpenACC 2.7 addition of default(none|present) support for data constructs. Now, specifying "default(none|present)" on a data construct turns on same default clause behavior for all enclosed compute constructs (which don't already themselves have a default clause). gcc/c/ChangeLog: * c-parser.cc (OACC_DATA_CLAUSE_MASK): Add PRAGMA_OACC_CLAUSE_DEFAULT. gcc/cp/ChangeLog: * parser.cc (OACC_DATA_CLAUSE_MASK): Add PRAGMA_OACC_CLAUSE_DEFAULT. gcc/fortran/ChangeLog: * openmp.cc (OACC_DATA_CLAUSES): Add OMP_CLAUSE_DEFAULT. gcc/ChangeLog: * gimplify.cc (struct gimplify_omp_ctx): Add oacc_data_default_kind field. (new_omp_context): Initialize oacc_data_default_kind field. (gimplify_scan_omp_clauses): Set oacc_data_default_kind for data constructs. Set ctx->default_kind for compute constructs from ctx->oacc_data_default_kind. gcc/testsuite/ChangeLog: * c-c++-common/goacc/default-3.c: Adjust testcase. * c-c++-common/goacc/default-5.c: Adjust testcase. * gfortran.dg/goacc/default-3.f95: Adjust testcase. * gfortran.dg/goacc/default-5.f: Adjust testcase. --- gcc/c/c-parser.cc | 1 + gcc/cp/parser.cc | 1 + gcc/fortran/openmp.cc | 3 ++- gcc/gimplify.cc | 20 +++ gcc/testsuite/c-c++-common/goacc/default-3.c | 15 +- gcc/testsuite/c-c++-common/goacc/default-5.c | 18 +++-- gcc/testsuite/gfortran.dg/goacc/default-3.f95 | 15 ++ gcc/testsuite/gfortran.dg/goacc/default-5.f | 17 ++-- 8 files changed, 84 insertions(+), 6 deletions(-) diff --git a/gcc/c/c-parser.cc b/gcc/c/c-parser.cc index b61aef8b1a2..645d28b320d 100644 --- a/gcc/c/c-parser.cc +++ b/gcc/c/c-parser.cc @@ -18133,6 +18133,7 @@ c_parser_oacc_cache (location_t loc, c_parser *parser) | (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_COPYIN) \ | (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_COPYOUT) \ | (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_CREATE) \ + | (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_DEFAULT) \ | (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_DEVICEPTR) \ | (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_IF) \ | (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_NO_CREATE) \ diff --git a/gcc/cp/parser.cc b/gcc/cp/parser.cc index dd7638f1c93..4b4df29a406 100644 --- a/gcc/cp/parser.cc +++ b/gcc/cp/parser.cc @@ -45759,6 +45759,7 @@ cp_parser_oacc_cache (cp_parser *parser, cp_token *pragma_tok) | (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_COPYIN) \ | (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_COPYOUT) \ | (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_CREATE) \ + | (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_DEFAULT) \ | (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_DETACH) \ | (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_DEVICEPTR) \ | (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_IF) \ diff --git a/gcc/fortran/openmp.cc b/gcc/fortran/openmp.cc index 4c30548567f..b785e71f20f 100644 --- a/gcc/fortran/openmp.cc +++ b/gcc/fortran/openmp.cc @@ -3645,7 +3645,8 @@ error: #define OACC_DATA_CLAUSES \ (omp_mask (OMP_CLAUSE_IF) | OMP_CLAUSE_DEVICEPTR | OMP_CLAUSE_COPY\
[PATCH, OpenACC 2.7] Implement self clause for compute constructs
Hi Thomas, This patch implements the compiler side for the 'self' clause for compute constructs: parallel, kernels, and serial. As you know, the actual "local device" device type for libgomp is not yet implemented, so the libgomp side is basically just a simple duplicate of what host-fallback is doing, though everything else should be completed by this patch. Tested on powerpc64le-linux/nvptx, x64_64-linux/amdgcn tests pending. Is this okay for trunk? Thanks, Chung-Lin 2023-06-13 Chung-Lin Tang gcc/c/ChangeLog: * c-parser.cc (c_parser_oacc_compute_clause_self): New function. (c_parser_oacc_all_clauses): Add new 'bool compute_p = false' parameter, add parsing of self clause when compute_p is true. (OACC_KERNELS_CLAUSE_MASK): Add PRAGMA_OACC_CLAUSE_SELF. (OACC_PARALLEL_CLAUSE_MASK): Likewise, (OACC_SERIAL_CLAUSE_MASK): Likewise. (c_parser_oacc_compute): Adjust call to c_parser_oacc_all_clauses to set compute_p argument to true. * c-typeck.cc (c_finish_omp_clauses): Add OMP_CLAUSE_SELF case. gcc/cp/ChangeLog: * parser.cc (cp_parser_oacc_compute_clause_self): New function. (cp_parser_oacc_all_clauses): Add new 'bool compute_p = false' parameter, add parsing of self clause when compute_p is true. (OACC_KERNELS_CLAUSE_MASK): Add PRAGMA_OACC_CLAUSE_SELF. (OACC_PARALLEL_CLAUSE_MASK): Likewise, (OACC_SERIAL_CLAUSE_MASK): Likewise. (cp_parser_oacc_compute): Adjust call to c_parser_oacc_all_clauses to set compute_p argument to true. * pt.cc (tsubst_omp_clauses): Add OMP_CLAUSE_SELF case. * c-typeck.cc (c_finish_omp_clauses): Add OMP_CLAUSE_SELF case, merged with OMP_CLAUSE_IF case. gcc/fortran/ChangeLog: * gfortran.h (typedef struct gfc_omp_clauses): Add self_expr field. * openmp.cc (enum omp_mask2): Add OMP_CLAUSE_SELF. (gfc_match_omp_clauses): Add handling for OMP_CLAUSE_SELF. (OACC_PARALLEL_CLAUSES): Add OMP_CLAUSE_SELF. (OACC_KERNELS_CLAUSES): Likewise. (OACC_SERIAL_CLAUSES): Likewise. (resolve_omp_clauses): Add handling for omp_clauses->self_expr. * trans-openmp.cc (gfc_trans_omp_clauses): Add handling of clauses->self_expr and building of OMP_CLAUSE_SELF tree clause. (gfc_split_omp_clauses): Add handling of self_expr field copy. gcc/ChangeLog: * gimplify.cc (gimplify_scan_omp_clauses): Add OMP_CLAUSE_SELF case. (gimplify_adjust_omp_clauses): Likewise. * omp-expand.cc (expand_omp_target): Add OMP_CLAUSE_SELF expansion code, * omp-low.cc (scan_sharing_clauses): Add OMP_CLAUSE_SELF case. * tree-core.h (enum omp_clause_code): Add OMP_CLAUSE_SELF enum. * tree-nested.cc (convert_nonlocal_omp_clauses): Add OMP_CLAUSE_SELF case. (convert_local_omp_clauses): Likewise. * tree-pretty-print.cc (dump_omp_clause): Add OMP_CLAUSE_SELF case. * tree.cc (omp_clause_num_ops): Add OMP_CLAUSE_SELF entry. (omp_clause_code_name): Likewise. * tree.h (OMP_CLAUSE_SELF_EXPR): New macro. gcc/testsuite/ChangeLog: * c-c++-common/goacc/self-clause-1.c: New test. * c-c++-common/goacc/self-clause-2.c: New test. * gfortran.dg/goacc/self.f95: New test. include/ChangeLog: * gomp-constants.h (GOACC_FLAG_LOCAL_DEVICE): New flag bit value. libgomp/ChangeLog: * oacc-parallel.c (GOACC_parallel_keyed): Add code to handle GOACC_FLAG_LOCAL_DEVICE case. * testsuite/libgomp.oacc-c-c++-common/self-1.c: New test.From 449883981c8e1f707b47ff8f8dd70049b9ffda82 Mon Sep 17 00:00:00 2001 From: Chung-Lin Tang Date: Tue, 13 Jun 2023 08:44:31 -0700 Subject: [PATCH] OpenACC 2.7: Implement self clause for compute constructs This patch implements the 'self' clause for compute constructs: parallel, kernels, and serial. This clause conditionally uses the local device (the host mult-core CPU) as the executing device of the compute region. The actual implementation of the "local device" device type inside libgomp (presumably using pthreads) is still not yet completed, so the libgomp side is still implemented the exact same as host-fallback mode. (so as of now, it essentially behaves like the 'if' clause with the condition inverted) gcc/c/ChangeLog: * c-parser.cc (c_parser_oacc_compute_clause_self): New function. (c_parser_oacc_all_clauses): Add new 'bool compute_p = false' parameter, add parsing of self clause when compute_p is true. (OACC_KERNELS_CLAUSE_MASK): Add PRAGMA_OACC_CLAUSE_SELF. (OACC_PARALLEL_CLAUSE_MASK): Likewise, (OACC_SERIAL_CLAUSE_MASK): Likewise. (c_parser_oacc_compute): Adjust call to c_parser_oacc_all_clauses to set compute_p argument to true. * c-typeck.cc (c_finish_omp_clauses): Add OMP_CLAUSE_SELF case. gcc/cp/ChangeLog: * parser.cc (cp_parser_oacc_compute_c
[PATCH, OpenACC 2.7, v2] Implement host_data must have use_device clause requirement
On 2023/6/16 5:13 PM, Thomas Schwinge wrote: > OK with one small change, please -- unless there's a reason for doing it > this way: > >> --- a/gcc/fortran/trans-openmp.cc >> +++ b/gcc/fortran/trans-openmp.cc >> @@ -4677,6 +4677,12 @@ gfc_trans_oacc_construct (gfc_code *code) >> break; >>case EXEC_OACC_HOST_DATA: >> construct_code = OACC_HOST_DATA; >> + if (code->ext.omp_clauses->lists[OMP_LIST_USE_DEVICE] == NULL) >> + { >> + error_at (gfc_get_location (&code->loc), >> + "% construct requires % >> clause"); >> + return NULL_TREE; >> + } >> break; >>default: >> gcc_unreachable (); > The OpenMP "must contain at least one [...] clause" checks are done in > 'gcc/fortran/openmp.cc:resolve_omp_clauses'. For consistency (or, to let > 'gcc/fortran/trans-openmp.cc' continue to just deal with "directive > translation"), do similar for OpenACC 'host_data'? (..., and we later > accordingly adjust 'gcc/fortran/openmp.cc:gfc_match_oacc_update', too?) Hi Thomas, I've adjusted the Fortran implementation as you described. Yes, I agree this way more fits current Fortran FE conventions. I've re-tested the attached v2 patch, will commit later this week if no major objections. Thanks, Chung-Lin gcc/c/ChangeLog: * c-parser.cc (c_parser_oacc_host_data): Add checking requiring OpenACC host_data construct to have an use_device clause. gcc/cp/ChangeLog: * parser.cc (cp_parser_oacc_host_data): Add checking requiring OpenACC host_data construct to have an use_device clause. gcc/fortran/ChangeLog: * openmp.cc (resolve_omp_clauses): Add checking requiring OpenACC host_data construct to have an use_device clause. gcc/testsuite/ChangeLog: * c-c++-common/goacc/host_data-2.c: Adjust testcase. * gfortran.dg/goacc/host_data-error.f90: New testcase. * gfortran.dg/goacc/pr71704.f90: Adjust testcase.diff --git a/gcc/c/c-parser.cc b/gcc/c/c-parser.cc index 24a6eb6e459..80920b31f83 100644 --- a/gcc/c/c-parser.cc +++ b/gcc/c/c-parser.cc @@ -18461,8 +18461,13 @@ c_parser_oacc_host_data (location_t loc, c_parser *parser, bool *if_p) tree stmt, clauses, block; clauses = c_parser_oacc_all_clauses (parser, OACC_HOST_DATA_CLAUSE_MASK, - "#pragma acc host_data"); - + "#pragma acc host_data", false); + if (!omp_find_clause (clauses, OMP_CLAUSE_USE_DEVICE_PTR)) +{ + error_at (loc, "% construct requires % clause"); + return error_mark_node; +} + clauses = c_finish_omp_clauses (clauses, C_ORT_ACC); block = c_begin_omp_parallel (); add_stmt (c_parser_omp_structured_block (parser, if_p)); stmt = c_finish_oacc_host_data (loc, clauses, block); diff --git a/gcc/cp/parser.cc b/gcc/cp/parser.cc index 5e2b5cba57e..beb5b632e5e 100644 --- a/gcc/cp/parser.cc +++ b/gcc/cp/parser.cc @@ -45895,8 +45895,15 @@ cp_parser_oacc_host_data (cp_parser *parser, cp_token *pragma_tok, bool *if_p) unsigned int save; clauses = cp_parser_oacc_all_clauses (parser, OACC_HOST_DATA_CLAUSE_MASK, - "#pragma acc host_data", pragma_tok); - + "#pragma acc host_data", pragma_tok, + false); + if (!omp_find_clause (clauses, OMP_CLAUSE_USE_DEVICE_PTR)) +{ + error_at (pragma_tok->location, + "% construct requires % clause"); + return error_mark_node; +} + clauses = finish_omp_clauses (clauses, C_ORT_ACC); block = begin_omp_parallel (); save = cp_parser_begin_omp_structured_block (parser); cp_parser_statement (parser, NULL_TREE, false, if_p); diff --git a/gcc/fortran/openmp.cc b/gcc/fortran/openmp.cc index 8efc4b3ecfa..f7af02845de 100644 --- a/gcc/fortran/openmp.cc +++ b/gcc/fortran/openmp.cc @@ -8764,6 +8764,12 @@ resolve_omp_clauses (gfc_code *code, gfc_omp_clauses *omp_clauses, "% clause", &omp_clauses->detach->where); } + if (openacc + && code->op == EXEC_OACC_HOST_DATA + && omp_clauses->lists[OMP_LIST_USE_DEVICE] == NULL) +gfc_error ("% construct at %L requires % clause", + &code->loc); + if (omp_clauses->assume) gfc_resolve_omp_assumptions (omp_clauses->assume); } diff --git a/gcc/testsuite/c-c++-common/goacc/host_data-2.c b/gcc/testsuite/c-c++-common/goacc/host_data-2.c index b3093e575ff..862a764eb3a 100644 --- a/gcc/testsuite/c-c++-common/goacc/host_data-2.c +++ b/gcc/testsuite/c-c++-common/goacc/host_data-2.c @@ -8,7 +8,9 @@ void f (void) { int v2 = 3; -#pragma acc host_data copy(v2) /* { dg-error ".copy. is not valid for ..pragma acc host_data." } */ +#pragma acc host_data copy(v2) + /* { dg-error ".copy. is not valid for ..pragma acc host_data." "" { target *-*-* } .-1 } */ + /* { dg-error ".host_data. construct requires .use_device. clause" "" { target *-*-* } .-2 } */
Re: [PATCH, OpenACC 2.7] Implement default clause support for data constructs
Hi Thomas, On 2023/6/23 6:47 PM, Thomas Schwinge wrote: >> + >>ctx->clauses = *orig_list_p; >>gimplify_omp_ctxp = ctx; >> } > Instead of this, in 'gimplify_omp_workshare', before the > 'gimplify_scan_omp_clauses' call, do something like: > > if ((ort & ORT_ACC) > && !omp_find_clause (OMP_CLAUSES (expr), OMP_CLAUSE_DEFAULT)) > { > /* Determine effective 'default' clause for OpenACC compute > construct. */ > for (struct gimplify_omp_ctx *ctx = gimplify_omp_ctxp; ctx; ctx = > ctx->outer_context) > { > if (ctx->region_type == ORT_ACC_DATA > && ctx->default_kind != OMP_CLAUSE_DEFAULT_SHARED) > { > [Append actual default clause on compute construct.] > break; > } > } > } > > That seems conceptually simpler to me? I'm not sure if this is conceptually simpler, but using 'oacc_default_kind' is definitely faster computationally :) However, as you mention below... > For the 'build_omp_clause', does using 'ctx->location' instead of > 'UNKNOWN_LOCATION' help diagnostics in any way? Like if we add in > 'gcc/gimplify.cc:oacc_default_clause', > 'if (ctx->default_kind == OMP_CLAUSE_DEFAULT_NONE)' another 'inform' to > point to the 'data' construct's 'default' clause? (But not sure if > that's easily done; otherwise don't.) Noticed that we will need to track the actually lexically enclosing OpenACC construct with the user set default-clause somewhere in 'ctx', in order to satisfy the current diagnostics in oacc_default_clause(). (the UNKNOWN_LOCATION for the internally created default-clause probably doesn't matter, that one is just for reminder in internal dumps, probably never plays role in user diagnostics) > Similar to the ones you've already got, please also add a few test cases > for nested 'default' clauses, like: > > #pragma acc data // no vs. 'default(none)' vs. 'default(present)' > { > #pragma acc data // no vs. same vs. different 'default' clause > { > #pragma acc data // no vs. same vs. different 'default' clause > { > #pragma acc parallel > > Similarly, test cases where 'default' on the compute construct overrides > 'default' of an outer 'data' construct. Okay, will add more testcases. Thanks, Chung-Lin
[PATCH, OpenACC 2.7] Connect readonly modifier to points-to analysis
On 2023/7/11 2:33 AM, Chung-Lin Tang via Gcc-patches wrote: > As we discussed earlier, the work for actually linking this to middle-end > points-to analysis is a somewhat non-trivial issue. This first patch allows > the language feature to be used in OpenACC directives first (with no effect > for now). > The middle-end changes are probably going to be a later patch. This second patch tries to link the readonly modifier to points-to analysis. There already exists SSA_NAME_POINTS_TO_READONLY_MEMORY and it's support in the alias oracle routines in tree-ssa-alias.cc, so basically what this patch does is try to make the variables holding the array section base pointers to have this flag set. There is an another OMP_CLAUSE_MAP_POINTS_TO_READONLY set by front-ends on the associated pointer clauses if OMP_CLAUSE_MAP_READONLY is set. Also a DECL_POINTS_TO_READONLY flag is set for VAR_DECLs when creating the tmp vars carrying these receiver references on the offloaded side. These eventually get translated to SSA_NAME_POINTS_TO_READONLY_MEMORY. This still doesn't always work as expected in terms of optimization: struct pointer fields and Fortran arrays (kind of like C structs) which have several accesses to create the pointer access on the receive/offloaded side, and SRA appears to not work on these sequences, so gets in the way of much redundancy elimination. Currently have one testcase where we can demonstrate 'readonly' can avoid a clobber by function call. Tested on powerpc64le-linux/nvptx. Note this patch is create a-top of the front-end patch. (will respond to the other front-end patch comments later) Thanks, Chung-Lin 2023-07-25 Chung-Lin Tang gcc/c/ChangeLog: * c-typeck.cc (handle_omp_array_sections): Set OMP_CLAUSE_MAP_POINTS_TO_READONLY on pointer clause. gcc/cp/ChangeLog: * semantics.cc (handle_omp_array_sections): Set OMP_CLAUSE_MAP_POINTS_TO_READONLY on pointer clause. gcc/fortran/ChangeLog: * trans-openmp.cc (gfc_trans_omp_array_section): Set OMP_CLAUSE_MAP_POINTS_TO_READONLY on pointer clause. gcc/ChangeLog: * gimple-expr.cc (copy_var_decl): Copy DECL_POINTS_TO_READONLY for VAR_DECLs. * gimplify.cc (struct gimplify_omp_ctx): Add 'hash_set *pt_readonly_ptrs' field. (internal_get_tmp_var): Set DECL_POINTS_TO_READONLY/SSA_NAME_POINTS_TO_READONLY_MEMORY for new temp vars. (build_omp_struct_comp_nodes): Set OMP_CLAUSE_MAP_POINTS_TO_READONLY on pointer clause. (gimplify_scan_omp_clauses): Collect OMP_CLAUSE_MAP_POINTS_TO_READONLY to ctx->pt_readonly_ptrs. * omp-low.cc (lower_omp_target): Set DECL_POINTS_TO_READONLY for variables of receiver refs. * tree-pretty-print.cc (dump_omp_clause): Print OMP_CLAUSE_MAP_POINTS_TO_READONLY. (dump_generic_node): Print SSA_NAME_POINTS_TO_READONLY_MEMORY. * tree.h (DECL_POINTS_TO_READONLY): New macro. (OMP_CLAUSE_MAP_POINTS_TO_READONLY): New macro. gcc/testsuite/ChangeLog: * c-c++-common/goacc/readonly-1.c: Adjust testcase. * c-c++-common/goacc/readonly-2.c: New testcase. * gfortran.dg/goacc/readonly-1.f90: Adjust testcase. diff --git a/gcc/c/c-typeck.cc b/gcc/c/c-typeck.cc index 7cf411155c6..42591e4029a 100644 --- a/gcc/c/c-typeck.cc +++ b/gcc/c/c-typeck.cc @@ -14258,6 +14258,8 @@ handle_omp_array_sections (tree c, enum c_omp_region_type ort) OMP_CLAUSE_SET_MAP_KIND (c2, GOMP_MAP_ATTACH_DETACH); else OMP_CLAUSE_SET_MAP_KIND (c2, GOMP_MAP_FIRSTPRIVATE_POINTER); + if (OMP_CLAUSE_MAP_READONLY (c)) + OMP_CLAUSE_MAP_POINTS_TO_READONLY (c2) = 1; OMP_CLAUSE_MAP_IMPLICIT (c2) = OMP_CLAUSE_MAP_IMPLICIT (c); if (OMP_CLAUSE_MAP_KIND (c2) != GOMP_MAP_FIRSTPRIVATE_POINTER && !c_mark_addressable (t)) diff --git a/gcc/cp/semantics.cc b/gcc/cp/semantics.cc index 8fb47fd179e..6ab467e1140 100644 --- a/gcc/cp/semantics.cc +++ b/gcc/cp/semantics.cc @@ -5872,6 +5872,8 @@ handle_omp_array_sections (tree c, enum c_omp_region_type ort) } else OMP_CLAUSE_SET_MAP_KIND (c2, GOMP_MAP_FIRSTPRIVATE_POINTER); + if (OMP_CLAUSE_MAP_READONLY (c)) + OMP_CLAUSE_MAP_POINTS_TO_READONLY (c2) = 1; OMP_CLAUSE_MAP_IMPLICIT (c2) = OMP_CLAUSE_MAP_IMPLICIT (c); if (OMP_CLAUSE_MAP_KIND (c2) != GOMP_MAP_FIRSTPRIVATE_POINTER && !cxx_mark_addressable (t)) diff --git a/gcc/fortran/trans-openmp.cc b/gcc/fortran/trans-openmp.cc index 2253d559f9c..d7cd65af1bb 100644 --- a/gcc/fortran/trans-openmp.cc +++ b/gcc/fortran/trans-openmp.cc @@ -2524,6 +2524,8 @@ gfc_trans_omp_array_section (stmtblock_t *block, gfc_exec_op op, node3 = build_omp_clause (input_location, OMP_CLAUSE_MAP); OMP_CLAUSE_SET_MAP_KIND (node3, ptr_kind); OMP_CLAUSE_DECL (node3)
[PATCH, OpenACC 2.7, v2] Implement default clause support for data constructs
Hi Thomas, this is v2 of the patch for implementing the OpenACC 2.7 addition of default(none|present) support for data constructs. Instead of propagating an additional 'oacc_default_kind' for OpenACC, this patch does it in a more complete way: it directly propagates the gimplify_omp_ctx* pointer of the inner most context where we found a default-clause. This supports displaying the location/type of OpenACC construct where the default-clause is in the error messages. The testcases also have the multiple nested data construct testing added, where we can now have messages referring precisely to the exact innermost default clause that was active at that program point. Note, I got rid of the dummy OMP_CLAUSE_DEFAULT creation in this version, since it seemed not really needed. Re-tested on master on powerpc64le-linux/nvptx. Okay to commit? Thanks, Chung-Lin 2023-08-01 Chung-Lin Tang gcc/c/ChangeLog: * c-parser.cc (OACC_DATA_CLAUSE_MASK): Add PRAGMA_OACC_CLAUSE_DEFAULT. gcc/cp/ChangeLog: * parser.cc (OACC_DATA_CLAUSE_MASK): Add PRAGMA_OACC_CLAUSE_DEFAULT. gcc/fortran/ChangeLog: * openmp.cc (OACC_DATA_CLAUSES): Add OMP_CLAUSE_DEFAULT. gcc/ChangeLog: * gimplify.cc (struct gimplify_omp_ctx): Add oacc_default_clause_ctx field. (new_omp_context): Initialize oacc_default_clause_ctx field. (oacc_region_type_name): New function. (oacc_default_clause): Lookup current default_kind value from ctx->oacc_default_clause_ctx, adjust default(none) error and inform message dumping. (gimplify_scan_omp_clauses): Upon OMP_CLAUSE_DEFAULT case, set ctx->oacc_default_clause_ctx to current context. gcc/testsuite/ChangeLog: * c-c++-common/goacc/default-3.c: Adjust testcase. * c-c++-common/goacc/default-5.c: Adjust testcase. * gfortran.dg/goacc/default-3.f95: Adjust testcase. * gfortran.dg/goacc/default-5.f: Adjust testcase.diff --git a/gcc/c/c-parser.cc b/gcc/c/c-parser.cc index 24a6eb6e459..974f0132787 100644 --- a/gcc/c/c-parser.cc +++ b/gcc/c/c-parser.cc @@ -18196,6 +18196,7 @@ c_parser_oacc_cache (location_t loc, c_parser *parser) | (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_COPYIN) \ | (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_COPYOUT) \ | (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_CREATE) \ + | (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_DEFAULT) \ | (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_DEVICEPTR) \ | (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_IF) \ | (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_NO_CREATE) \ diff --git a/gcc/cp/parser.cc b/gcc/cp/parser.cc index d7ef5b34d42..bc59fbeac20 100644 --- a/gcc/cp/parser.cc +++ b/gcc/cp/parser.cc @@ -45860,6 +45860,7 @@ cp_parser_oacc_cache (cp_parser *parser, cp_token *pragma_tok) | (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_COPYIN) \ | (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_COPYOUT) \ | (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_CREATE) \ + | (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_DEFAULT) \ | (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_DETACH) \ | (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_DEVICEPTR) \ | (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_IF) \ diff --git a/gcc/fortran/openmp.cc b/gcc/fortran/openmp.cc index 2952cd300ac..c37f843ec3b 100644 --- a/gcc/fortran/openmp.cc +++ b/gcc/fortran/openmp.cc @@ -3802,7 +3802,8 @@ error: #define OACC_DATA_CLAUSES \ (omp_mask (OMP_CLAUSE_IF) | OMP_CLAUSE_DEVICEPTR | OMP_CLAUSE_COPY\ | OMP_CLAUSE_COPYIN | OMP_CLAUSE_COPYOUT | OMP_CLAUSE_CREATE \ - | OMP_CLAUSE_NO_CREATE | OMP_CLAUSE_PRESENT | OMP_CLAUSE_ATTACH) + | OMP_CLAUSE_NO_CREATE | OMP_CLAUSE_PRESENT | OMP_CLAUSE_ATTACH \ + | OMP_CLAUSE_DEFAULT) #define OACC_LOOP_CLAUSES \ (omp_mask (OMP_CLAUSE_COLLAPSE) | OMP_CLAUSE_GANG | OMP_CLAUSE_WORKER \ | OMP_CLAUSE_VECTOR | OMP_CLAUSE_SEQ | OMP_CLAUSE_INDEPENDENT \ diff --git a/gcc/gimplify.cc b/gcc/gimplify.cc index 320920ed74c..ec0ccc67da8 100644 --- a/gcc/gimplify.cc +++ b/gcc/gimplify.cc @@ -225,6 +225,7 @@ struct gimplify_omp_ctx vec loop_iter_var; location_t location; enum omp_clause_default_kind default_kind; + struct gimplify_omp_ctx *oacc_default_clause_ctx; enum omp_region_type region_type; enum tree_code code; bool combined_loop; @@ -459,6 +460,10 @@ new_omp_context (enum omp_region_type region_type) c->default_kind = OMP_CLAUSE_DEFAULT_SHARED; else c->default_kind = OMP_CLAUSE_DEFAULT_UNSPECIFIED; + if (gimplify_omp_ctxp) +c->oacc_default_clause_ctx = gimplify_omp_ctxp->oacc_default_clause_ctx; + else +c->oacc_default_clause_ctx = c; c->defaultmap[GDMK_SCALAR] =
[PATCH, OpenACC 2.7, v2] readonly modifier support in front-ends
Hi Thomas, Tobias, here's the updated v2 of the readonly modifier front-end patch. On 2023/7/20 11:08 PM, Tobias Burnus wrote: >>> +++ b/gcc/c/c-parser.cc >>> @@ -14059,7 +14059,8 @@ c_parser_omp_variable_list (c_parser *parser, >>> >>> static tree >>> c_parser_omp_var_list_parens (c_parser *parser, enum omp_clause_code kind, >>> - tree list, bool allow_deref = false) >>> + tree list, bool allow_deref = false, >>> + bool *readonly = NULL) >>> ... >> Instead of doing this in 'c_parser_omp_var_list_parens', I think it's >> clearer to have this special 'readonly :' parsing logic in the two places >> where it's used. > I concur. The same issue also occurred for OpenMP's > c_parser_omp_clause_to, and c_parser_omp_clause_from and the 'present' > modifier. For it, I created a combined function but the main reason for > that is that OpenMP also permits more modifiers (like 'iterators'), > which would cause more duplication of code ('iterator' is not yet > supported). > > For something as simple to parse as this modifier, I would just do it at > the two places – as Thomas suggested. Okay, I've changed the C/C++ parser parts to have the parsing logic directly added. >>> +++ b/gcc/fortran/gfortran.h >>> @@ -1360,7 +1360,11 @@ typedef struct gfc_omp_namelist >>> { >>> gfc_omp_reduction_op reduction_op; >>> gfc_omp_depend_doacross_op depend_doacross_op; >>> - gfc_omp_map_op map_op; >>> + struct >>> +{ >>> + ENUM_BITFIELD (gfc_omp_map_op) map_op:8; >>> + bool readonly; >>> +}; >>> gfc_expr *align; >>> struct >>>{ >> [...] Thus, the above looks good to me. > I concur but I wonder whether it would be cleaner to name the struct; > this makes it also more obvious what belongs together in the union. > > Namely, naming the struct 'map' and then changing the 45 users from > 'u.map_op' to 'u.map.op' and the new 'u.readonly' to 'u.map.readonly'. – > this seems to be cleaner. I've adjusted 'u.map' to be a named struct now, and updated the references. >> + if (gfc_match ("readonly :") == MATCH_YES) >> I note this one does not have a space after ':' in 'gfc_match', but the >> one above in 'gfc_match_omp_clauses' does. I don't know off-hand if that >> makes a difference in parsing -- probably not, as all of >> 'gcc/fortran/openmp.cc' generally doesn't seem to be very consistent >> about these two variants? > It *does* make a difference. And for obvious reasons. You don't want to > permit: > >!$acc kernels asnyccopy(a) > > but require at least one space (or comma) between "async" and "copy".. > (In fixed form Fortran, it would be fine - as would be "!$acc k e nelsasy nc > co p y(a)".) > > A " " matches zero or more whitespaces, but with gfc_match_space you can find > out > whether there was whitespace or not. Okay, made sure both are 'gfc_match ("readonly : ")'. Thanks for catching that, didn't realize that space was significant. >>> +++ b/gcc/tree.h >>> @@ -1813,6 +1813,14 @@ class auto_suppress_location_wrappers >>> #define OMP_CLAUSE_MAP_DECL_MAKE_ADDRESSABLE(NODE) \ >>> (OMP_CLAUSE_SUBCODE_CHECK (NODE, OMP_CLAUSE_MAP)->base.addressable_flag) >>> >>> +/* Nonzero if OpenACC 'readonly' modifier set, used for 'copyin'. */ >>> +#define OMP_CLAUSE_MAP_READONLY(NODE) \ >>> + TREE_READONLY (OMP_CLAUSE_SUBCODE_CHECK (NODE, OMP_CLAUSE_MAP)) >>> + >>> +/* Same as above, for use in OpenACC cache directives. */ >>> +#define OMP_CLAUSE__CACHE__READONLY(NODE) \ >>> + TREE_READONLY (OMP_CLAUSE_SUBCODE_CHECK (NODE, OMP_CLAUSE__CACHE_)) >> I'm not sure if these special accessor functions are actually useful, or >> we should just directly use 'TREE_READONLY' instead? We're only using >> them in contexts where it's clear that the 'OMP_CLAUSE_SUBCODE_CHECK' is >> satisfied, for example. > I find directly using TREE_READONLY confusing. FWIW, I've changed to use TREE_NOTHROW instead, if it can give a better sense of safety :P I think there's a misunderstanding here anyways: we are not relying on a DECL marked TREE_READONLY here. We merely need the OMP_CLAUSE_MAP to be marked as OMP_CLAUSE_MAP_READONLY == 1. The other points-to patch then (also in front-ends) take the OMP_CLAUSE_MAP_READONLY to mark the clauses of "base-pointers of array-sections" as OMP_CLAUSE_MAP_POINTS_TO_READONLY, and later this gradually gets relayed to alias oracle routines in tree-ssa-alias.cc Re-tested this v2 patch on powerpc64le-linux/nvptx. Okay for trunk? Thanks, Chung-Lin 2023-08-07 Chung-Lin Tang gcc/c/ChangeLog: * c-parser.cc (c_parser_oacc_data_clause): Add parsing support for 'readonly' modifier, set OMP_CLAUSE_MAP_READONLY if readonly modifier found, update comments. (c_parser_oacc_cache): Add parsing support for 'readonly' modifier, set OMP_CLAUSE__CACHE__READONLY if readonly modifier found, update comments.
[PATCH, OpenACC 2.7] Adjust acc_map_data/acc_unmap_data interaction with reference counters
Hi Thomas, This patch adjusts the implementation of acc_map_data/acc_unmap_data API library routines to more fit the description in the OpenACC 2.7 specification. Instead of using REFCOUNT_INFINITY, we now define a REFCOUNT_ACC_MAP_DATA special value to mark acc_map_data-created mappings, and allow adjustment of dynamic_refcount of such mappings by other constructs. Enforcing of an initial value of 1 for such mappings, and only allowing acc_unmap_data to delete such mappings, is implemented as specified. Actually, there is no real change (or improvement) in behavior of the API (thus no new tests) I've looked at the related OpenACC spec issues, and it seems that this part of the 2.7 spec change is mostly a clarification (see no downside in current REFCOUNT_INFINITY based implementation either). But this patch does make the internals more close to the spec description. Tested without regressions using powerpc64le-linux/nvptx, okay for trunk? Thanks, Chung-Lin 2023-06-22 Chung-Lin Tang libgomp/ChangeLog: * libgomp.h (REFCOUNT_ACC_MAP_DATA): Define as (REFCOUNT_SPECIAL | 2). * oacc-mem.c (acc_map_data): Adjust to use REFCOUNT_ACC_MAP_DATA, initialize dynamic_refcount as 1. (acc_unmap_data): Adjust to use REFCOUNT_ACC_MAP_DATA, (goacc_map_var_existing): Add REFCOUNT_ACC_MAP_DATA case. (goacc_exit_datum_1): Add REFCOUNT_ACC_MAP_DATA case, respect REFCOUNT_ACC_MAP_DATA when decrementing/finalizing. Force lowest dynamic_refcount to be 1 for REFCOUNT_ACC_MAP_DATA. * target.c (gomp_increment_refcount): Add REFCOUNT_ACC_MAP_DATA case. (gomp_decrement_refcount): Add REFCOUNT_ACC_MAP_DATA case, force lowest dynamic_refcount to be 1 for REFCOUNT_ACC_MAP_DATA. * testsuite/libgomp.oacc-c-c++-common/unmap-infinity-1.c: Adjust testcase error output scan test. diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h index 4d2bfab4b71..fb8ef651dfb 100644 --- a/libgomp/libgomp.h +++ b/libgomp/libgomp.h @@ -1166,6 +1166,8 @@ struct target_mem_desc; /* Special value for refcount - tgt_offset contains target address of the artificial pointer to "omp declare target link" object. */ #define REFCOUNT_LINK (REFCOUNT_SPECIAL | 1) +/* Special value for refcount - created through acc_map_data. */ +#define REFCOUNT_ACC_MAP_DATA (REFCOUNT_SPECIAL | 2) /* Special value for refcount - structure element sibling list items. All such key refounts have REFCOUNT_STRUCTELEM bits set, with _FLAG_FIRST diff --git a/libgomp/oacc-mem.c b/libgomp/oacc-mem.c index fe632740769..2a782ac22c1 100644 --- a/libgomp/oacc-mem.c +++ b/libgomp/oacc-mem.c @@ -411,7 +411,8 @@ acc_map_data (void *h, void *d, size_t s) assert (n->refcount == 1); assert (n->dynamic_refcount == 0); /* Special reference counting behavior. */ - n->refcount = REFCOUNT_INFINITY; + n->refcount = REFCOUNT_ACC_MAP_DATA; + n->dynamic_refcount = 1; if (profiling_p) { @@ -460,7 +461,7 @@ acc_unmap_data (void *h) the different 'REFCOUNT_INFINITY' cases, or simply separate 'REFCOUNT_INFINITY' values per different usage ('REFCOUNT_ACC_MAP_DATA' etc.)? */ - else if (n->refcount != REFCOUNT_INFINITY) + else if (n->refcount != REFCOUNT_ACC_MAP_DATA) { gomp_mutex_unlock (&acc_dev->lock); gomp_fatal ("refusing to unmap block [%p,+%d] that has not been mapped" @@ -519,7 +520,8 @@ goacc_map_var_existing (struct gomp_device_descr *acc_dev, void *hostaddr, } assert (n->refcount != REFCOUNT_LINK); - if (n->refcount != REFCOUNT_INFINITY) + if (n->refcount != REFCOUNT_INFINITY + && n->refcount != REFCOUNT_ACC_MAP_DATA) n->refcount++; n->dynamic_refcount++; @@ -683,6 +685,7 @@ goacc_exit_datum_1 (struct gomp_device_descr *acc_dev, void *h, size_t s, assert (n->refcount != REFCOUNT_LINK); if (n->refcount != REFCOUNT_INFINITY + && n->refcount != REFCOUNT_ACC_MAP_DATA && n->refcount < n->dynamic_refcount) { gomp_mutex_unlock (&acc_dev->lock); @@ -691,15 +694,27 @@ goacc_exit_datum_1 (struct gomp_device_descr *acc_dev, void *h, size_t s, if (finalize) { - if (n->refcount != REFCOUNT_INFINITY) + if (n->refcount != REFCOUNT_INFINITY + && n->refcount != REFCOUNT_ACC_MAP_DATA) n->refcount -= n->dynamic_refcount; - n->dynamic_refcount = 0; + + if (n->refcount == REFCOUNT_ACC_MAP_DATA) + /* Mappings created by acc_map_data are returned to initial + dynamic_refcount of 1. Can only be deleted by acc_unmap_data. */ + n->dynamic_refcount = 1; + else + n->dynamic_refcount = 0; } else if (n->dynamic_refcount) { - if (n->refcount != REFCOUNT_INFINITY) + if (n->refcount != REFCOUNT_INFINITY + && n->refcount != REFCOUNT_ACC_MAP_DATA) n->refcount--; - n->dynamic_refcount--; + + /* When mapping is created by acc_map_data, dynam
[PATCH, OpenACC 2.7] readonly modifier support in front-ends
Hi Thomas, this patch contains support for the 'readonly' modifier in copyin clauses and the cache directive. As we discussed earlier, the work for actually linking this to middle-end points-to analysis is a somewhat non-trivial issue. This first patch allows the language feature to be used in OpenACC directives first (with no effect for now). The middle-end changes are probably going to be a later patch. (Also CCing Tobias because of the Fortran bits) Tested on powerpc64le-linux with nvptx offloading. Is this okay for trunk? Thanks, Chung-Lin 2023-07-10 Chung-Lin Tang gcc/c/ChangeLog: * c-parser.cc (c_parser_omp_var_list_parens): Add 'bool *readonly = NULL' parameter, add readonly modifier parsing support. (c_parser_oacc_data_clause): Adjust c_parser_omp_var_list_parens call to turn on readonly modifier parsing for copyin clause, set OMP_CLAUSE_MAP_READONLY if readonly modifier found, update comments. (c_parser_oacc_cache): Adjust c_parser_omp_var_list_parens call to turn on readonly modifier parsing, set OMP_CLAUSE__CACHE__READONLY if readonly modifier found, update comments. gcc/cp/ChangeLog: * parser.cc (cp_parser_omp_var_list): Add 'bool *readonly = NULL' parameter, add readonly modifier parsing support. (cp_parser_oacc_data_clause): Adjust cp_parser_omp_var_list call to turn on readonly modifier parsing for copyin clause, set OMP_CLAUSE_MAP_READONLY if readonly modifier found, update comments. (cp_parser_oacc_cache): Adjust cp_parser_omp_var_list call to turn on readonly modifier parsing, set OMP_CLAUSE__CACHE__READONLY if readonly modifier found, update comments. gcc/fortran/ChangeLog: * gfortran.h (typedef struct gfc_omp_namelist): Adjust map_op as ENUM_BITFIELD field, add 'bool readonly' field. * openmp.cc (gfc_match_omp_map_clause): Add 'bool readonly = false' parameter, set n->u.readonly field. (gfc_match_omp_clauses): Add readonly modifier parsing for OpenACC copyin clause, adjust call to gfc_match_omp_map_clause. (gfc_match_oacc_cache): Add readonly modifier parsing for OpenACC cache directive, adjust call to gfc_match_omp_map_clause. * trans-openmp.cc (gfc_trans_omp_clauses): Set OMP_CLAUSE_MAP_READONLY, OMP_CLAUSE__CACHE__READONLY to 1 when readonly is set. gcc/ChangeLog: * tree-pretty-print.cc (dump_omp_clause): Add support for printing OMP_CLAUSE_MAP_READONLY and OMP_CLAUSE__CACHE__READONLY. * tree.h (OMP_CLAUSE_MAP_READONLY): New macro. (OMP_CLAUSE__CACHE__READONLY): New macro. gcc/testsuite/ChangeLog: * c-c++-common/goacc/readonly-1.c: New test. * gfortran.dg/goacc/readonly-1.f90: New test. diff --git a/gcc/c/c-parser.cc b/gcc/c/c-parser.cc index d4b98d5d8b6..09e1e89d793 100644 --- a/gcc/c/c-parser.cc +++ b/gcc/c/c-parser.cc @@ -14059,7 +14059,8 @@ c_parser_omp_variable_list (c_parser *parser, static tree c_parser_omp_var_list_parens (c_parser *parser, enum omp_clause_code kind, - tree list, bool allow_deref = false) + tree list, bool allow_deref = false, + bool *readonly = NULL) { /* The clauses location. */ location_t loc = c_parser_peek_token (parser)->location; @@ -14067,6 +14068,20 @@ c_parser_omp_var_list_parens (c_parser *parser, enum omp_clause_code kind, matching_parens parens; if (parens.require_open (parser)) { + if (readonly != NULL) + { + c_token *token = c_parser_peek_token (parser); + if (token->type == CPP_NAME + && !strcmp (IDENTIFIER_POINTER (token->value), "readonly") + && c_parser_peek_2nd_token (parser)->type == CPP_COLON) + { + c_parser_consume_token (parser); + c_parser_consume_token (parser); + *readonly = true; + } + else + *readonly = false; + } list = c_parser_omp_variable_list (parser, loc, kind, list, allow_deref); parens.skip_until_found_close (parser); } @@ -14084,7 +14099,11 @@ c_parser_omp_var_list_parens (c_parser *parser, enum omp_clause_code kind, OpenACC 2.6: no_create ( variable-list ) attach ( variable-list ) - detach ( variable-list ) */ + detach ( variable-list ) + + OpenACC 2.7: + copyin (readonly : variable-list ) + */ static tree c_parser_oacc_data_clause (c_parser *parser, pragma_omp_clause c_kind, @@ -14135,11 +14154,22 @@ c_parser_oacc_data_clause (c_parser *parser, pragma_omp_clause c_kind, default: gcc_unreachable (); } + + /* Turn on readonly modifier parsing for copyin clause. */ + bool readonly = false, *readonly_ptr = NULL; + if (c_kind == PRAGMA_OACC_CLAUSE_COPYIN) +readonly_ptr = &readonly; + tree nl, c; - nl = c_parser_omp_var_list_parens (pars
[PATCH, nvptx, 1/2] Reimplement libgomp barriers for nvptx
Hi Tom, I had a patch submitted earlier, where I reported that the current way of implementing barriers in libgomp on nvptx created a quite significant performance drop on some SPEChpc2021 benchmarks: https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html That previous patch wasn't accepted well (admittedly, it was kind of a hack). So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX. Basically, instead of trying to have the GPU do CPU-with-OS-like things that it isn't suited for, barriers are implemented simplistically with bar.* synchronization instructions. Tasks are processed after threads have joined, and only if team->task_count != 0 (arguably, there might be a little bit of performance forfeited where earlier arriving threads could've been used to process tasks ahead of other threads. But that again falls into requiring implementing complex futex-wait/wake like behavior. Really, that kind of tasking is not what target offloading is usually used for) Implementation highlight notes: 1. gomp_team_barrier_wake() is now an empty function (threads never "wake" in the usual manner) 2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction. 3. gomp_barrier_wait_last() now is implemented using "bar.arrive" 4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end(): The main synchronization is done using a 'bar.red' instruction. This reduces across all threads the condition (team->task_count != 0), to enable the task processing down below if any thread created a task. (this bar.red usage required the need of the second GCC patch in this series) This patch has been tested on x86_64/powerpc64le with nvptx offloading, using libgomp, ovo, omptests, and sollve_vv testsuites, all without regressions. Also verified that the SPEChpc 2021 521.miniswp_t and 534.hpgmgfv_t performance regressions that occurred in the GCC12 cycle has been restored to devel/omp/gcc-11 (OG11) branch levels. Is this okay for trunk? (also suggest backporting to GCC12 branch, if performance regression can be considered a defect) Thanks, Chung-Lin libgomp/ChangeLog: 2022-09-21 Chung-Lin Tang * config/nvptx/bar.c (generation_to_barrier): Remove. (futex_wait,futex_wake,do_spin,do_wait): Remove. (GOMP_WAIT_H): Remove. (#include "../linux/bar.c"): Remove. (gomp_barrier_wait_end): New function. (gomp_barrier_wait): Likewise. (gomp_barrier_wait_last): Likewise. (gomp_team_barrier_wait_end): Likewise. (gomp_team_barrier_wait): Likewise. (gomp_team_barrier_wait_final): Likewise. (gomp_team_barrier_wait_cancel_end): Likewise. (gomp_team_barrier_wait_cancel): Likewise. (gomp_team_barrier_cancel): Likewise. * config/nvptx/bar.h (gomp_team_barrier_wake): Remove prototype, add new static inline function. diff --git a/libgomp/config/nvptx/bar.c b/libgomp/config/nvptx/bar.c index eee2107..0b958ed 100644 --- a/libgomp/config/nvptx/bar.c +++ b/libgomp/config/nvptx/bar.c @@ -30,137 +30,143 @@ #include #include "libgomp.h" -/* For cpu_relax. */ -#include "doacross.h" - -/* Assuming ADDR is &bar->generation, return bar. Copied from - rtems/bar.c. */ +void +gomp_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state) +{ + if (__builtin_expect (state & BAR_WAS_LAST, 0)) +{ + /* Next time we'll be awaiting TOTAL threads again. */ + bar->awaited = bar->total; + __atomic_store_n (&bar->generation, bar->generation + BAR_INCR, + MEMMODEL_RELEASE); +} + if (bar->total > 1) +asm ("bar.sync 1, %0;" : : "r" (32 * bar->total)); +} -static gomp_barrier_t * -generation_to_barrier (int *addr) +void +gomp_barrier_wait (gomp_barrier_t *bar) { - char *bar -= (char *) addr - __builtin_offsetof (gomp_barrier_t, generation); - return (gomp_barrier_t *)bar; + gomp_barrier_wait_end (bar, gomp_barrier_wait_start (bar)); } -/* Implement futex_wait-like behaviour to plug into the linux/bar.c - implementation. Assumes ADDR is &bar->generation. */ +/* Like gomp_barrier_wait, except that if the encountering thread + is not the last one to hit the barrier, it returns immediately. + The intended usage is that a thread which intends to gomp_barrier_destroy + this barrier calls gomp_barrier_wait, while all other threads + call gomp_barrier_wait_last. When gomp_barrier_wait returns, + the barrier can be safely destroyed. */ -static inline void -futex_wait (int *addr, int val) +void +gomp_barrier_wait_last (gomp_barrier_t *bar) { - gomp_barrier_t *bar = generation_to_barrier (addr); + /* The above described behavior matches 'bar.arrive' perfectly. */ + if (bar->total > 1) +asm ("bar.arrive 1, %0;" : : "r" (32 * bar->total)); +} - if (bar->total < 2) -/* A barrier with less than two threads, nop. */ -return; +void +gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_
[PATCH, nvptx, 2/2] Reimplement libgomp barriers for nvptx: bar.red instruction support in GCC
Hi Tom, following the first patch. This new barrier implementation I posted in the first patch uses the 'bar.red' instruction. Usually this could've been easily done with a single line of inline assembly. However I quickly realized that because the NVPTX GCC port is implemented with all virtual general registers, we don't have a register constraint usable to select "predicate registers". Since bar.red uses predicate typed values, I can't create it directly using inline asm. So it appears that the most simple way of accessing it is with a target builtin. The attached patch adds bar.red instructions to the nvptx port, and __builtin_nvptx_bar_red_* builtins to use it. The code should support all variations of bar.red (and, or, and popc operations). (This support was used to implement the first libgomp barrier patch, so must be approved together) Thanks, Chung-Lin 2022-09-21 Chung-Lin Tang gcc/ChangeLog: * config/nvptx/nvptx.cc (nvptx_print_operand): Add 'p' case, adjust comments. (enum nvptx_builtins): Add NVPTX_BUILTIN_BAR_RED_AND, NVPTX_BUILTIN_BAR_RED_OR, and NVPTX_BUILTIN_BAR_RED_POPC. (nvptx_expand_bar_red): New function. (nvptx_init_builtins): Add DEFs of __builtin_nvptx_bar_red_[and/or/popc]. (nvptx_expand_builtin): Use nvptx_expand_bar_red to expand NVPTX_BUILTIN_BAR_RED_[AND/OR/POPC] cases. * config/nvptx/nvptx.md (define_c_enum "unspecv"): Add UNSPECV_BARRED_AND, UNSPECV_BARRED_OR, and UNSPECV_BARRED_POPC. (BARRED): New int iterator. (barred_op,barred_mode,barred_ptxtype): New int attrs. (nvptx_barred_): New define_insn. diff --git a/gcc/config/nvptx/nvptx.cc b/gcc/config/nvptx/nvptx.cc index 49cc681..afc3a890 100644 --- a/gcc/config/nvptx/nvptx.cc +++ b/gcc/config/nvptx/nvptx.cc @@ -2879,6 +2879,7 @@ nvptx_mem_maybe_shared_p (const_rtx x) t -- print a type opcode suffix, promoting QImode to 32 bits T -- print a type size in bits u -- print a type opcode suffix without promotions. + p -- print a '!' for constant 0. x -- print a destination operand that may also be a bit bucket. */ static void @@ -3012,6 +3013,11 @@ nvptx_print_operand (FILE *file, rtx x, int code) fprintf (file, "@!"); goto common; +case 'p': + if (INTVAL (x) == 0) + fprintf (file, "!"); + break; + case 'c': mode = GET_MODE (XEXP (x, 0)); switch (x_code) @@ -6151,9 +6157,90 @@ enum nvptx_builtins NVPTX_BUILTIN_CMP_SWAPLL, NVPTX_BUILTIN_MEMBAR_GL, NVPTX_BUILTIN_MEMBAR_CTA, + NVPTX_BUILTIN_BAR_RED_AND, + NVPTX_BUILTIN_BAR_RED_OR, + NVPTX_BUILTIN_BAR_RED_POPC, NVPTX_BUILTIN_MAX }; +/* Expander for 'bar.red' instruction builtins. */ + +static rtx +nvptx_expand_bar_red (tree exp, rtx target, + machine_mode ARG_UNUSED (m), int ARG_UNUSED (ignore)) +{ + int code = DECL_MD_FUNCTION_CODE (TREE_OPERAND (CALL_EXPR_FN (exp), 0)); + machine_mode mode = TYPE_MODE (TREE_TYPE (exp)); + + if (!target) +target = gen_reg_rtx (mode); + + rtx pred, dst; + rtx bar = expand_expr (CALL_EXPR_ARG (exp, 0), +NULL_RTX, SImode, EXPAND_NORMAL); + rtx nthr = expand_expr (CALL_EXPR_ARG (exp, 1), + NULL_RTX, SImode, EXPAND_NORMAL); + rtx cpl = expand_expr (CALL_EXPR_ARG (exp, 2), +NULL_RTX, SImode, EXPAND_NORMAL); + rtx redop = expand_expr (CALL_EXPR_ARG (exp, 3), + NULL_RTX, SImode, EXPAND_NORMAL); + if (CONST_INT_P (bar)) +{ + if (INTVAL (bar) < 0 || INTVAL (bar) > 15) + { + error_at (EXPR_LOCATION (exp), + "barrier value must be within [0,15]"); + return const0_rtx; + } +} + else if (!REG_P (bar)) +bar = copy_to_mode_reg (SImode, bar); + + if (!CONST_INT_P (nthr) && !REG_P (nthr)) +nthr = copy_to_mode_reg (SImode, nthr); + + if (!CONST_INT_P (cpl)) +{ + error_at (EXPR_LOCATION (exp), + "complement argument must be constant"); + return const0_rtx; +} + + pred = gen_reg_rtx (BImode); + if (!REG_P (redop)) +redop = copy_to_mode_reg (SImode, redop); + emit_insn (gen_rtx_SET (pred, gen_rtx_NE (BImode, redop, GEN_INT (0; + redop = pred; + + rtx pat; + switch (code) +{ +case NVPTX_BUILTIN_BAR_RED_AND: + dst = gen_reg_rtx (BImode); + pat = gen_nvptx_barred_and (dst, bar, nthr, cpl, redop); + break; +case NVPTX_BUILTIN_BAR_RED_OR: + dst = gen_reg_rtx (BImode); + pat = gen_nvptx_barred_or (dst, bar, nthr, cpl, redop); + break; +case NVPTX_BUILTIN_BAR_RED_POPC: + dst = gen_reg_rtx (SImode); + pat = gen_nvptx_barred_popc (dst, bar, nthr, cpl, redop); + break; +default: + gcc_unreachable (); +} + emit_insn (pat); + if (GET_MODE (dst) == BImode) +{ + rtx tmp = gen_reg_rtx (mode); + emit_insn (gen_rtx_SET (tmp, gen_rtx_NE (m
Re: [PATCH, nvptx, 1/2] Reimplement libgomp barriers for nvptx
On 2022/9/21 5:01 PM, Jakub Jelinek wrote: On Wed, Sep 21, 2022 at 03:45:36PM +0800, Chung-Lin Tang via Gcc-patches wrote: Hi Tom, I had a patch submitted earlier, where I reported that the current way of implementing barriers in libgomp on nvptx created a quite significant performance drop on some SPEChpc2021 benchmarks: https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html That previous patch wasn't accepted well (admittedly, it was kind of a hack). So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX. Basically, instead of trying to have the GPU do CPU-with-OS-like things that it isn't suited for, barriers are implemented simplistically with bar.* synchronization instructions. Tasks are processed after threads have joined, and only if team->task_count != 0 (arguably, there might be a little bit of performance forfeited where earlier arriving threads could've been used to process tasks ahead of other threads. But that again falls into requiring implementing complex futex-wait/wake like behavior. Really, that kind of tasking is not what target offloading is usually used for) I admit I don't have a good picture if people in real-world actually use tasking in offloading regions and how much and in what way, but the above definitely would be a show-stopper for typical tasking workloads, where one thread (usually from master/masked/single construct's body) creates lots of tasks and can spend considerable amount of time in those preparations, while other threads are expected to handle those tasks. I think the most common use case for target offloading is "parallel for". Really, not simply removing tasking altogether from target regions in the specification is just looking for trouble. If asynchronous offloaded tasks are to be supported, something at the whole GPU offload region level is much more reasonable, like the async clause functionality in OpenACC. Do we have an idea how are other implementations handling this? I think it should be easily observable with atomics, have master/masked/single that creates lots of tasks and then spends a long time doing something, have very small task bodies that just increment some atomic counter and at the end of the master/masked/single see how many tasks were already encountered. This could be an interesting test... Note, I don't have any smart ideas how to handle this instead and what you posted might be ok for what people usually do on offloading targets in OpenMP if they use tasking at all, just wanted to mention that there could be workloads where the above is a serious problem. If there are say hundreds of threads doing nothing until a single thread reaches a barrier and there are hundreds of pending tasks... I think it might still be doable, just not in the very fine "wake one thread" style that the Linux-based implementation was doing. E.g. note we have that 64 pending task limit after which we start to create undeferred tasks, so if we never start handling tasks until one thread is done with them, that would mean the single thread would create 64 deferred tasks and then handle all the others itself making it even longer until the other tasks can deal with it. Okay, thanks for reminding that. Chung-Lin
[PATCH, OG10, OpenMP 5.0, committed] Implement relaxation of implicit map vs. existing device mappings
This patch implements relaxing the requirements when a map with the implicit attribute encounters an overlapping existing map. As the OpenMP 5.0 spec describes on page 320, lines 18-27 (and 5.1 spec, page 352, lines 13-22): "If a single contiguous part of the original storage of a list item with an implicit data-mapping attribute has corresponding storage in the device data environment prior to a task encountering the construct that is associated with the map clause, only that part of the original storage will have corresponding storage in the device data environment as a result of the map clause." Also tracked in the OpenMP spec context as issue #1463: https://github.com/OpenMP/spec/issues/1463 The implementation inside the compiler is to of course, tag the implicitly created maps with some indication of "implicit". I've done this with a OMP_CLAUSE_MAP_IMPLICIT_P macro, using 'base.deprecated_flag' underneath. There is an encoding of this as GOMP_MAP_IMPLICIT == GOMP_MAP_FLAG_SPECIAL_3|GOMP_MAP_FLAG_SPECIAL_4 in include/gomp-constants.h for the runtime, but I've intentionally avoided exploding the entire gimplify/omp-low with a new set of GOMP_MAP_IMPLICIT_TO/FROM/etc. symbols, instead adding in the new flag bits only at the final runtime call generation during omp-lowering. The rest is libgomp mapping taking care of the implicit case: allowing map success if an existing map is a proper subset of the new map, if the new map is implicit. Straightforward enough I think. There are also some additions to print the implicit attribute during tree pretty-printing, for that reason some scan tests were updated. Also, another adjustment in this patch is how implicitly created clauses are added to the current clause list in gimplify_adjust_omp_clauses(). Instead of simply appending the new clauses to the end, this patch adds them at the position "after initial non-map clauses, but right before any existing map clauses". The reason for this is: when combined with other map clauses, for example: #pragma omp target map(rec.ptr[:N]) for (int i = 0; i < N; i++) rec.ptr[i] += 1; There will be an implicit map created for map(rec), because of the access inside the target region. The expectation is that 'rec' is implicitly mapped, and then the pointed array-section part by 'rec.ptr' will be mapped, and then attachment to the 'rec.ptr' field of the mapped 'rec' (in that order). If the implicit 'map(rec)' is appended to the end, instead of placed before other maps, the attachment operation will not find anything to attach to, and the entire region will fail. Note: this touches a bit on another issue which I will be sending a patch for later: per the discussion on omp-lang, an array section list item should *not* be mapping its base-pointer (although an attachment attempt should exist), while in current GCC behavior, for struct member pointers like 'rec.ptr' above, we do map it (which should be deemed incorrect). This means that as of right now, this modification of map order doesn't really exhibit the above mentioned behavior yet. I have included it as part of this patch because the "[implicit]" tree printing requires modifying many gimple scan tests already, so including the test modifications together seems more manageable patch-wise. Tested with no regressions, and pushed to devel/omp/gcc-10. Will be submitting a mainline trunk version later. Chung-Lin 2021-05-05 Chung-Lin Tang include/ChangeLog: * gomp-constants.h (GOMP_MAP_IMPLICIT): New special map kind bits value. (GOMP_MAP_FLAG_SPECIAL_BITS): Define helper mask for whole set of special map kind bits. (GOMP_MAP_NONCONTIG_ARRAY_P): Adjust test for non-contiguous array map kind bits to be more specific. (GOMP_MAP_IMPLICIT_P): New predicate macro for implicit map kinds. gcc/ChangeLog: * tree.h (OMP_CLAUSE_MAP_IMPLICIT_P): New access macro for 'implicit' bit, using 'base.deprecated_flag' field of tree_node. * tree-pretty-print.c (dump_omp_clause): Add support for printing implicit attribute in tree dumping. * gimplify.c (gimplify_adjust_omp_clauses_1): Set OMP_CLAUSE_MAP_IMPLICIT_P to 1 if map clause is implicitly created. (gimplify_adjust_omp_clauses): Adjust place of adding implicitly created clauses, from simple append, to starting of list, after non-map clauses. * omp-low.c (lower_omp_target): Add GOMP_MAP_IMPLICIT bits into kind values passed to libgomp for implicit maps. gcc/testsuite/ChangeLog: * c-c++-common/gomp/target-implicit-map-1.c: New test. * c-c++-common/goacc/combined-reduction.c: Adjust scan test pattern. * c-c++-common/goacc/firstprivate-mappings-1.c: Likewise. * c-c++-common/goacc/mdc-1.c: Likewise. * c-c++-common/goacc/reduction-1.c: Likewise. * c-c++-common/goacc/reduction-2.c: Likewise. * c-c++-common/goacc
[Ping x6] Re: [PATCH, nvptx, 1/2] Reimplement libgomp barriers for nvptx
Ping x6 On 2022/12/6 12:21 AM, Chung-Lin Tang wrote: > Ping x5 > > On 2022/11/22 12:24 上午, Chung-Lin Tang wrote: >> Ping x4 >> >> On 2022/11/8 12:34 AM, Chung-Lin Tang wrote: >>> Ping x3. >>> >>> On 2022/10/31 10:18 PM, Chung-Lin Tang wrote: >>>> Ping x2. >>>> >>>> On 2022/10/17 10:29 PM, Chung-Lin Tang wrote: >>>>> Ping. >>>>> >>>>> On 2022/9/21 3:45 PM, Chung-Lin Tang via Gcc-patches wrote: >>>>>> Hi Tom, >>>>>> I had a patch submitted earlier, where I reported that the current way >>>>>> of implementing >>>>>> barriers in libgomp on nvptx created a quite significant performance >>>>>> drop on some SPEChpc2021 >>>>>> benchmarks: >>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html >>>>>> That previous patch wasn't accepted well (admittedly, it was kind of a >>>>>> hack). >>>>>> So in this patch, I tried to (mostly) re-implement team-barriers for >>>>>> NVPTX. >>>>>> >>>>>> Basically, instead of trying to have the GPU do CPU-with-OS-like things >>>>>> that it isn't suited for, >>>>>> barriers are implemented simplistically with bar.* synchronization >>>>>> instructions. >>>>>> Tasks are processed after threads have joined, and only if >>>>>> team->task_count != 0 >>>>>> >>>>>> (arguably, there might be a little bit of performance forfeited where >>>>>> earlier arriving threads >>>>>> could've been used to process tasks ahead of other threads. But that >>>>>> again falls into requiring >>>>>> implementing complex futex-wait/wake like behavior. Really, that kind of >>>>>> tasking is not what target >>>>>> offloading is usually used for) >>>>>> >>>>>> Implementation highlight notes: >>>>>> 1. gomp_team_barrier_wake() is now an empty function (threads never >>>>>> "wake" in the usual manner) >>>>>> 2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction. >>>>>> 3. gomp_barrier_wait_last() now is implemented using "bar.arrive" >>>>>> >>>>>> 4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end(): >>>>>> The main synchronization is done using a 'bar.red' instruction. This >>>>>> reduces across all threads >>>>>> the condition (team->task_count != 0), to enable the task processing >>>>>> down below if any thread >>>>>> created a task. (this bar.red usage required the need of the second >>>>>> GCC patch in this series) >>>>>> >>>>>> This patch has been tested on x86_64/powerpc64le with nvptx offloading, >>>>>> using libgomp, ovo, omptests, >>>>>> and sollve_vv testsuites, all without regressions. Also verified that >>>>>> the SPEChpc 2021 521.miniswp_t >>>>>> and 534.hpgmgfv_t performance regressions that occurred in the GCC12 >>>>>> cycle has been restored to >>>>>> devel/omp/gcc-11 (OG11) branch levels. Is this okay for trunk? >>>>>> >>>>>> (also suggest backporting to GCC12 branch, if performance regression can >>>>>> be considered a defect) >>>>>> >>>>>> Thanks, >>>>>> Chung-Lin >>>>>> >>>>>> libgomp/ChangeLog: >>>>>> >>>>>> 2022-09-21 Chung-Lin Tang >>>>>> >>>>>> * config/nvptx/bar.c (generation_to_barrier): Remove. >>>>>> (futex_wait,futex_wake,do_spin,do_wait): Remove. >>>>>> (GOMP_WAIT_H): Remove. >>>>>> (#include "../linux/bar.c"): Remove. >>>>>> (gomp_barrier_wait_end): New function. >>>>>> (gomp_barrier_wait): Likewise. >>>>>> (gomp_barrier_wait_last): Likewise. >>>>>> (gomp_team_barrier_wait_end): Likewise. >>>>>> (gomp_team_barrier_wait): Likewise. >>>>>> (gomp_team_barrier_wait_final): Likewise. >>>>>> (gomp_team_barrier_wait_cancel_end): Likewise. >>>>>> (gomp_team_barrier_wait_cancel): Likewise. >>>>>> (gomp_team_barrier_cancel): Likewise. >>>>>> * config/nvptx/bar.h (gomp_team_barrier_wake): Remove >>>>>> prototype, add new static inline function. >>> >> >
Re: nvptx: Avoid deadlock in 'cuStreamAddCallback' callback, error case (was: [PATCH 6/6, OpenACC, libgomp] Async re-work, nvptx changes)
Hi Thomas, On 2023/1/12 9:51 PM, Thomas Schwinge wrote: > In my case, 'cuda_callback_wrapper' (expectedly) gets invoked with > 'res != CUDA_SUCCESS' ("an illegal memory access was encountered"). > When we invoke 'GOMP_PLUGIN_fatal', this attempts to shut down the device > (..., which deadlocks); that's generally problematic: per > https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__STREAM.html#group__CUDA__STREAM_1g613d97a277d7640f4cb1c03bd51c2483 > "'cuStreamAddCallback' [...] Callbacks must not make any CUDA API calls". I remember running into this myself when first creating this async support (IIRC in my case it was cuFree()-ing something) yet you've found another mistake here! :) > Given that eventually we must reach a host/device synchronization point > (latest when the device is shut down at program termination), and the > non-'CUDA_SUCCESS' will be upheld until then, it does seem safe to > replace this 'GOMP_PLUGIN_fatal' with 'GOMP_PLUGIN_error' as per the > "nvptx: Avoid deadlock in 'cuStreamAddCallback' callback, error case" > attached. OK to push? I think this patch is fine. Actual approval powers are your's or Tom's :) > > (Might we even skip 'GOMP_PLUGIN_error' here, understanding that the > error will be caught and reported at the next host/device synchronization > point? But I've not verified that.) Actually, the CUDA driver API docs are a bit vague on what exactly this CUresult arg to the callback actually means. The 'res != CUDA_SUCCESS' handling here was basically just generic handling. I am not really sure what is the true right thing to do here (is the error still retained by CUDA after the callback completes?) Chung-Lin
[Ping x3] Re: [PATCH, nvptx, 1/2] Reimplement libgomp barriers for nvptx
Ping x3. On 2022/10/31 10:18 PM, Chung-Lin Tang wrote: > Ping x2. > > On 2022/10/17 10:29 PM, Chung-Lin Tang wrote: >> Ping. >> >> On 2022/9/21 3:45 PM, Chung-Lin Tang via Gcc-patches wrote: >>> Hi Tom, >>> I had a patch submitted earlier, where I reported that the current way of >>> implementing >>> barriers in libgomp on nvptx created a quite significant performance drop >>> on some SPEChpc2021 >>> benchmarks: >>> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html >>> >>> That previous patch wasn't accepted well (admittedly, it was kind of a >>> hack). >>> So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX. >>> >>> Basically, instead of trying to have the GPU do CPU-with-OS-like things >>> that it isn't suited for, >>> barriers are implemented simplistically with bar.* synchronization >>> instructions. >>> Tasks are processed after threads have joined, and only if team->task_count >>> != 0 >>> >>> (arguably, there might be a little bit of performance forfeited where >>> earlier arriving threads >>> could've been used to process tasks ahead of other threads. But that again >>> falls into requiring >>> implementing complex futex-wait/wake like behavior. Really, that kind of >>> tasking is not what target >>> offloading is usually used for) >>> >>> Implementation highlight notes: >>> 1. gomp_team_barrier_wake() is now an empty function (threads never "wake" >>> in the usual manner) >>> 2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction. >>> 3. gomp_barrier_wait_last() now is implemented using "bar.arrive" >>> >>> 4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end(): >>> The main synchronization is done using a 'bar.red' instruction. This >>> reduces across all threads >>> the condition (team->task_count != 0), to enable the task processing >>> down below if any thread >>> created a task. (this bar.red usage required the need of the second GCC >>> patch in this series) >>> >>> This patch has been tested on x86_64/powerpc64le with nvptx offloading, >>> using libgomp, ovo, omptests, >>> and sollve_vv testsuites, all without regressions. Also verified that the >>> SPEChpc 2021 521.miniswp_t >>> and 534.hpgmgfv_t performance regressions that occurred in the GCC12 cycle >>> has been restored to >>> devel/omp/gcc-11 (OG11) branch levels. Is this okay for trunk? >>> >>> (also suggest backporting to GCC12 branch, if performance regression can be >>> considered a defect) >>> >>> Thanks, >>> Chung-Lin >>> >>> libgomp/ChangeLog: >>> >>> 2022-09-21 Chung-Lin Tang >>> >>> * config/nvptx/bar.c (generation_to_barrier): Remove. >>> (futex_wait,futex_wake,do_spin,do_wait): Remove. >>> (GOMP_WAIT_H): Remove. >>> (#include "../linux/bar.c"): Remove. >>> (gomp_barrier_wait_end): New function. >>> (gomp_barrier_wait): Likewise. >>> (gomp_barrier_wait_last): Likewise. >>> (gomp_team_barrier_wait_end): Likewise. >>> (gomp_team_barrier_wait): Likewise. >>> (gomp_team_barrier_wait_final): Likewise. >>> (gomp_team_barrier_wait_cancel_end): Likewise. >>> (gomp_team_barrier_wait_cancel): Likewise. >>> (gomp_team_barrier_cancel): Likewise. >>> * config/nvptx/bar.h (gomp_team_barrier_wake): Remove >>> prototype, add new static inline function.
[Ping x4] Re: [PATCH, nvptx, 1/2] Reimplement libgomp barriers for nvptx
Ping x4 On 2022/11/8 12:34 AM, Chung-Lin Tang wrote: > Ping x3. > > On 2022/10/31 10:18 PM, Chung-Lin Tang wrote: >> Ping x2. >> >> On 2022/10/17 10:29 PM, Chung-Lin Tang wrote: >>> Ping. >>> >>> On 2022/9/21 3:45 PM, Chung-Lin Tang via Gcc-patches wrote: >>>> Hi Tom, >>>> I had a patch submitted earlier, where I reported that the current way of >>>> implementing >>>> barriers in libgomp on nvptx created a quite significant performance drop >>>> on some SPEChpc2021 >>>> benchmarks: >>>> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html>>>>> >>>> That previous patch wasn't accepted well (admittedly, it was kind of a >>>> hack). >>>> So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX. >>>> >>>> Basically, instead of trying to have the GPU do CPU-with-OS-like things >>>> that it isn't suited for, >>>> barriers are implemented simplistically with bar.* synchronization >>>> instructions. >>>> Tasks are processed after threads have joined, and only if >>>> team->task_count != 0 >>>> >>>> (arguably, there might be a little bit of performance forfeited where >>>> earlier arriving threads >>>> could've been used to process tasks ahead of other threads. But that again >>>> falls into requiring >>>> implementing complex futex-wait/wake like behavior. Really, that kind of >>>> tasking is not what target >>>> offloading is usually used for) >>>> >>>> Implementation highlight notes: >>>> 1. gomp_team_barrier_wake() is now an empty function (threads never "wake" >>>> in the usual manner) >>>> 2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction. >>>> 3. gomp_barrier_wait_last() now is implemented using "bar.arrive" >>>> >>>> 4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end(): >>>> The main synchronization is done using a 'bar.red' instruction. This >>>> reduces across all threads >>>> the condition (team->task_count != 0), to enable the task processing >>>> down below if any thread >>>> created a task. (this bar.red usage required the need of the second >>>> GCC patch in this series) >>>> >>>> This patch has been tested on x86_64/powerpc64le with nvptx offloading, >>>> using libgomp, ovo, omptests, >>>> and sollve_vv testsuites, all without regressions. Also verified that the >>>> SPEChpc 2021 521.miniswp_t >>>> and 534.hpgmgfv_t performance regressions that occurred in the GCC12 cycle >>>> has been restored to >>>> devel/omp/gcc-11 (OG11) branch levels. Is this okay for trunk? >>>> >>>> (also suggest backporting to GCC12 branch, if performance regression can >>>> be considered a defect) >>>> >>>> Thanks, >>>> Chung-Lin >>>> >>>> libgomp/ChangeLog: >>>> >>>> 2022-09-21 Chung-Lin Tang >>>> >>>>* config/nvptx/bar.c (generation_to_barrier): Remove. >>>>(futex_wait,futex_wake,do_spin,do_wait): Remove. >>>>(GOMP_WAIT_H): Remove. >>>>(#include "../linux/bar.c"): Remove. >>>>(gomp_barrier_wait_end): New function. >>>>(gomp_barrier_wait): Likewise. >>>>(gomp_barrier_wait_last): Likewise. >>>>(gomp_team_barrier_wait_end): Likewise. >>>>(gomp_team_barrier_wait): Likewise. >>>>(gomp_team_barrier_wait_final): Likewise. >>>>(gomp_team_barrier_wait_cancel_end): Likewise. >>>>(gomp_team_barrier_wait_cancel): Likewise. >>>>(gomp_team_barrier_cancel): Likewise. >>>>* config/nvptx/bar.h (gomp_team_barrier_wake): Remove >>>>prototype, add new static inline function. >
[Ping x5] Re: [PATCH, nvptx, 1/2] Reimplement libgomp barriers for nvptx
Ping x5 On 2022/11/22 12:24 上午, Chung-Lin Tang wrote: > Ping x4 > > On 2022/11/8 12:34 AM, Chung-Lin Tang wrote: >> Ping x3. >> >> On 2022/10/31 10:18 PM, Chung-Lin Tang wrote: >>> Ping x2. >>> >>> On 2022/10/17 10:29 PM, Chung-Lin Tang wrote: >>>> Ping. >>>> >>>> On 2022/9/21 3:45 PM, Chung-Lin Tang via Gcc-patches wrote: >>>>> Hi Tom, >>>>> I had a patch submitted earlier, where I reported that the current way of >>>>> implementing >>>>> barriers in libgomp on nvptx created a quite significant performance drop >>>>> on some SPEChpc2021 >>>>> benchmarks: >>>>> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html>>>>>> >>>>> That previous patch wasn't accepted well (admittedly, it was kind of a >>>>> hack). >>>>> So in this patch, I tried to (mostly) re-implement team-barriers for >>>>> NVPTX. >>>>> >>>>> Basically, instead of trying to have the GPU do CPU-with-OS-like things >>>>> that it isn't suited for, >>>>> barriers are implemented simplistically with bar.* synchronization >>>>> instructions. >>>>> Tasks are processed after threads have joined, and only if >>>>> team->task_count != 0 >>>>> >>>>> (arguably, there might be a little bit of performance forfeited where >>>>> earlier arriving threads >>>>> could've been used to process tasks ahead of other threads. But that >>>>> again falls into requiring >>>>> implementing complex futex-wait/wake like behavior. Really, that kind of >>>>> tasking is not what target >>>>> offloading is usually used for) >>>>> >>>>> Implementation highlight notes: >>>>> 1. gomp_team_barrier_wake() is now an empty function (threads never >>>>> "wake" in the usual manner) >>>>> 2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction. >>>>> 3. gomp_barrier_wait_last() now is implemented using "bar.arrive" >>>>> >>>>> 4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end(): >>>>> The main synchronization is done using a 'bar.red' instruction. This >>>>> reduces across all threads >>>>> the condition (team->task_count != 0), to enable the task processing >>>>> down below if any thread >>>>> created a task. (this bar.red usage required the need of the second >>>>> GCC patch in this series) >>>>> >>>>> This patch has been tested on x86_64/powerpc64le with nvptx offloading, >>>>> using libgomp, ovo, omptests, >>>>> and sollve_vv testsuites, all without regressions. Also verified that the >>>>> SPEChpc 2021 521.miniswp_t >>>>> and 534.hpgmgfv_t performance regressions that occurred in the GCC12 >>>>> cycle has been restored to >>>>> devel/omp/gcc-11 (OG11) branch levels. Is this okay for trunk? >>>>> >>>>> (also suggest backporting to GCC12 branch, if performance regression can >>>>> be considered a defect) >>>>> >>>>> Thanks, >>>>> Chung-Lin >>>>> >>>>> libgomp/ChangeLog: >>>>> >>>>> 2022-09-21 Chung-Lin Tang >>>>> >>>>> * config/nvptx/bar.c (generation_to_barrier): Remove. >>>>> (futex_wait,futex_wake,do_spin,do_wait): Remove. >>>>> (GOMP_WAIT_H): Remove. >>>>> (#include "../linux/bar.c"): Remove. >>>>> (gomp_barrier_wait_end): New function. >>>>> (gomp_barrier_wait): Likewise. >>>>> (gomp_barrier_wait_last): Likewise. >>>>> (gomp_team_barrier_wait_end): Likewise. >>>>> (gomp_team_barrier_wait): Likewise. >>>>> (gomp_team_barrier_wait_final): Likewise. >>>>> (gomp_team_barrier_wait_cancel_end): Likewise. >>>>> (gomp_team_barrier_wait_cancel): Likewise. >>>>> (gomp_team_barrier_cancel): Likewise. >>>>> * config/nvptx/bar.h (gomp_team_barrier_wake): Remove >>>>> prototype, add new static inline function. >> >