[PATCH, OpenACC 2.7] Implement host_data must have use_device clause requirement

2023-06-06 Thread Chung-Lin Tang via Gcc-patches
Hi Thomas,
this patch implements the OpenACC 2.7 change requiring the host_data construct
to have at least one use_device clause.

This patch started out with a simple check during gimplify (much smaller patch),
but turned out that front-ends removed use_device clauses when they have error,
and the gimplify check started to echo a "no use_device clause" message in such
cases, which seem confusing for the user. So ended up adding the check in each
front-end instead.

Tested on powerpc64le-linux/nvptx, x86_64-linux/amdgcn tests in progress (expect
no surprises). Is this okay for trunk?

Thanks,
Chung-Lin

gcc/c/ChangeLog:

* c-parser.cc (c_parser_oacc_host_data): Add checking requiring OpenACC
host_data construct to have an use_device clause.

gcc/cp/ChangeLog:

* parser.cc (cp_parser_oacc_host_data): Add checking requiring OpenACC
host_data construct to have an use_device clause.

gcc/fortran/ChangeLog:

* trans-openmp.cc (gfc_trans_oacc_construct): Add checking requiring
OpenACC host_data construct to have an use_device clause.

gcc/testsuite/ChangeLog:

* c-c++-common/goacc/host_data-2.c: Adjust testcase.
* gfortran.dg/goacc/host_data-error.f90: New testcase.
* gfortran.dg/goacc/pr71704.f90: Adjust testcase.
From 0d17b8d24fa6079d6c289305e9644c3fecd429f1 Mon Sep 17 00:00:00 2001
From: Chung-Lin Tang 
Date: Tue, 6 Jun 2023 03:19:33 -0700
Subject: [PATCH 1/2] OpenACC 2.7: host_data must have use_device clause
 requirement

This patch implements the OpenACC 2.7 change requiring the host_data construct
to have at least one use_device clause.

gcc/c/ChangeLog:

* c-parser.cc (c_parser_oacc_host_data): Add checking requiring OpenACC
host_data construct to have an use_device clause.

gcc/cp/ChangeLog:

* parser.cc (cp_parser_oacc_host_data): Add checking requiring OpenACC
host_data construct to have an use_device clause.

gcc/fortran/ChangeLog:

* trans-openmp.cc (gfc_trans_oacc_construct): Add checking requiring
OpenACC host_data construct to have an use_device clause.

gcc/testsuite/ChangeLog:

* c-c++-common/goacc/host_data-2.c: Adjust testcase.
* gfortran.dg/goacc/host_data-error.f90: New testcase.
* gfortran.dg/goacc/pr71704.f90: Adjust testcase.
---
 gcc/c/c-parser.cc   |  9 +++--
 gcc/cp/parser.cc| 11 +--
 gcc/fortran/trans-openmp.cc |  6 ++
 gcc/testsuite/c-c++-common/goacc/host_data-2.c  |  7 ++-
 gcc/testsuite/gfortran.dg/goacc/host_data-error.f90 |  6 ++
 gcc/testsuite/gfortran.dg/goacc/pr71704.f90 |  5 +++--
 6 files changed, 37 insertions(+), 7 deletions(-)
 create mode 100644 gcc/testsuite/gfortran.dg/goacc/host_data-error.f90

diff --git a/gcc/c/c-parser.cc b/gcc/c/c-parser.cc
index 5baa501dbee..b61aef8b1a2 100644
--- a/gcc/c/c-parser.cc
+++ b/gcc/c/c-parser.cc
@@ -18398,8 +18398,13 @@ c_parser_oacc_host_data (location_t loc, c_parser 
*parser, bool *if_p)
   tree stmt, clauses, block;
 
   clauses = c_parser_oacc_all_clauses (parser, OACC_HOST_DATA_CLAUSE_MASK,
-  "#pragma acc host_data");
-
+  "#pragma acc host_data", false);
+  if (!omp_find_clause (clauses, OMP_CLAUSE_USE_DEVICE_PTR))
+{
+  error_at (loc, "% construct requires % clause");
+  return error_mark_node;
+}
+  clauses = c_finish_omp_clauses (clauses, C_ORT_ACC);
   block = c_begin_omp_parallel ();
   add_stmt (c_parser_omp_structured_block (parser, if_p));
   stmt = c_finish_oacc_host_data (loc, clauses, block);
diff --git a/gcc/cp/parser.cc b/gcc/cp/parser.cc
index 1c9aa671851..dd7638f1c93 100644
--- a/gcc/cp/parser.cc
+++ b/gcc/cp/parser.cc
@@ -45798,8 +45798,15 @@ cp_parser_oacc_host_data (cp_parser *parser, cp_token 
*pragma_tok, bool *if_p)
   unsigned int save;
 
   clauses = cp_parser_oacc_all_clauses (parser, OACC_HOST_DATA_CLAUSE_MASK,
-   "#pragma acc host_data", pragma_tok);
-
+   "#pragma acc host_data", pragma_tok,
+   false);
+  if (!omp_find_clause (clauses, OMP_CLAUSE_USE_DEVICE_PTR))
+{
+  error_at (pragma_tok->location,
+   "% construct requires % clause");
+  return error_mark_node;
+}
+  clauses = finish_omp_clauses (clauses, C_ORT_ACC);
   block = begin_omp_parallel ();
   save = cp_parser_begin_omp_structured_block (parser);
   cp_parser_statement (parser, NULL_TREE, false, if_p);
diff --git a/gcc/fortran/trans-openmp.cc b/gcc/fortran/trans-openmp.cc
index 42b608f3d36..5e0079cce76 100644
--- a/gcc/fortran/trans-openmp.cc
+++ b/gcc/fortran/trans-openmp.cc
@@ -4677,6 +4677,12 @@ gfc_trans_oacc_construct (gfc_code *code)
break;
   case EXEC_OACC_HOST_DATA:
construct_code = OACC_HOST_DATA;
+   if (code->ex

[PATCH, OpenACC 2.7] Implement default clause support for data constructs

2023-06-06 Thread Chung-Lin Tang via Gcc-patches
Hi Thomas,
this patch implements the OpenACC 2.7 addition of default(none|present) support
for data constructs.

Apart from adjusting the front-ends for allowed clauses masks (for acc data),
mostly implemented in gimplify.

Tested on powerpc64le-linux/nvptx, x86_64-linux/amdgcn tests in progress (expect
no surprises). Is this okay for trunk?

Thanks,
Chung-Lin
gcc/c/ChangeLog:

* c-parser.cc (OACC_DATA_CLAUSE_MASK): Add PRAGMA_OACC_CLAUSE_DEFAULT.

gcc/cp/ChangeLog:

* parser.cc (OACC_DATA_CLAUSE_MASK): Add PRAGMA_OACC_CLAUSE_DEFAULT.

gcc/fortran/ChangeLog:

* openmp.cc (OACC_DATA_CLAUSES): Add OMP_CLAUSE_DEFAULT.

gcc/ChangeLog:

* gimplify.cc (struct gimplify_omp_ctx): Add oacc_data_default_kind
field.
(new_omp_context): Initialize oacc_data_default_kind field.
(gimplify_scan_omp_clauses): Set oacc_data_default_kind for data
constructs. Set ctx->default_kind for compute constructs from
ctx->oacc_data_default_kind.

gcc/testsuite/ChangeLog:

* c-c++-common/goacc/default-3.c: Adjust testcase.
* c-c++-common/goacc/default-5.c: Adjust testcase.
* gfortran.dg/goacc/default-3.f95: Adjust testcase.
* gfortran.dg/goacc/default-5.f: Adjust testcase.
From 101305aee9b27c6df00d7c403e469bdf8d7f45a4 Mon Sep 17 00:00:00 2001
From: Chung-Lin Tang 
Date: Tue, 6 Jun 2023 03:46:29 -0700
Subject: [PATCH 2/2] OpenACC 2.7: default clause support for data constructs

This patch implements the OpenACC 2.7 addition of default(none|present) support
for data constructs.

Now, specifying "default(none|present)" on a data construct turns on same
default clause behavior for all enclosed compute constructs (which don't
already themselves have a default clause).

gcc/c/ChangeLog:

* c-parser.cc (OACC_DATA_CLAUSE_MASK): Add PRAGMA_OACC_CLAUSE_DEFAULT.

gcc/cp/ChangeLog:

* parser.cc (OACC_DATA_CLAUSE_MASK): Add PRAGMA_OACC_CLAUSE_DEFAULT.

gcc/fortran/ChangeLog:

* openmp.cc (OACC_DATA_CLAUSES): Add OMP_CLAUSE_DEFAULT.

gcc/ChangeLog:

* gimplify.cc (struct gimplify_omp_ctx): Add oacc_data_default_kind
field.
(new_omp_context): Initialize oacc_data_default_kind field.
(gimplify_scan_omp_clauses): Set oacc_data_default_kind for data
constructs. Set ctx->default_kind for compute constructs from
ctx->oacc_data_default_kind.

gcc/testsuite/ChangeLog:

* c-c++-common/goacc/default-3.c: Adjust testcase.
* c-c++-common/goacc/default-5.c: Adjust testcase.
* gfortran.dg/goacc/default-3.f95: Adjust testcase.
* gfortran.dg/goacc/default-5.f: Adjust testcase.
---
 gcc/c/c-parser.cc |  1 +
 gcc/cp/parser.cc  |  1 +
 gcc/fortran/openmp.cc |  3 ++-
 gcc/gimplify.cc   | 20 +++
 gcc/testsuite/c-c++-common/goacc/default-3.c  | 15 +-
 gcc/testsuite/c-c++-common/goacc/default-5.c  | 18 +++--
 gcc/testsuite/gfortran.dg/goacc/default-3.f95 | 15 ++
 gcc/testsuite/gfortran.dg/goacc/default-5.f   | 17 ++--
 8 files changed, 84 insertions(+), 6 deletions(-)

diff --git a/gcc/c/c-parser.cc b/gcc/c/c-parser.cc
index b61aef8b1a2..645d28b320d 100644
--- a/gcc/c/c-parser.cc
+++ b/gcc/c/c-parser.cc
@@ -18133,6 +18133,7 @@ c_parser_oacc_cache (location_t loc, c_parser *parser)
| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_COPYIN)  \
| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_COPYOUT) \
| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_CREATE)  \
+   | (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_DEFAULT) \
| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_DEVICEPTR)   \
| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_IF)  \
| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_NO_CREATE)   \
diff --git a/gcc/cp/parser.cc b/gcc/cp/parser.cc
index dd7638f1c93..4b4df29a406 100644
--- a/gcc/cp/parser.cc
+++ b/gcc/cp/parser.cc
@@ -45759,6 +45759,7 @@ cp_parser_oacc_cache (cp_parser *parser, cp_token 
*pragma_tok)
| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_COPYIN)  \
| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_COPYOUT) \
| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_CREATE)  \
+   | (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_DEFAULT) \
| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_DETACH)  \
| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_DEVICEPTR)   \
| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_IF)  \
diff --git a/gcc/fortran/openmp.cc b/gcc/fortran/openmp.cc
index 4c30548567f..b785e71f20f 100644
--- a/gcc/fortran/openmp.cc
+++ b/gcc/fortran/openmp.cc
@@ -3645,7 +3645,8 @@ error:
 #define OACC_DATA_CLAUSES \
   (omp_mask (OMP_CLAUSE_IF) | OMP_CLAUSE_DEVICEPTR  | OMP_CLAUSE_COPY\

[PATCH, OpenACC 2.7] Implement self clause for compute constructs

2023-06-13 Thread Chung-Lin Tang via Gcc-patches
Hi Thomas,
This patch implements the compiler side for the 'self' clause for compute 
constructs:
parallel, kernels, and serial.

As you know, the actual "local device" device type for libgomp is not yet 
implemented,
so the libgomp side is basically just a simple duplicate of what host-fallback 
is doing,
though everything else should be completed by this patch.

Tested on powerpc64le-linux/nvptx, x64_64-linux/amdgcn tests pending.
Is this okay for trunk?

Thanks,
Chung-Lin

2023-06-13  Chung-Lin Tang  

gcc/c/ChangeLog:

* c-parser.cc (c_parser_oacc_compute_clause_self): New function.
(c_parser_oacc_all_clauses): Add new 'bool compute_p = false'
parameter, add parsing of self clause when compute_p is true.
(OACC_KERNELS_CLAUSE_MASK): Add PRAGMA_OACC_CLAUSE_SELF.
(OACC_PARALLEL_CLAUSE_MASK): Likewise,
(OACC_SERIAL_CLAUSE_MASK): Likewise.
(c_parser_oacc_compute): Adjust call to c_parser_oacc_all_clauses to
set compute_p argument to true.
* c-typeck.cc (c_finish_omp_clauses): Add OMP_CLAUSE_SELF case.

gcc/cp/ChangeLog:

* parser.cc (cp_parser_oacc_compute_clause_self): New function.
(cp_parser_oacc_all_clauses): Add new 'bool compute_p = false'
parameter, add parsing of self clause when compute_p is true.
(OACC_KERNELS_CLAUSE_MASK): Add PRAGMA_OACC_CLAUSE_SELF.
(OACC_PARALLEL_CLAUSE_MASK): Likewise,
(OACC_SERIAL_CLAUSE_MASK): Likewise.
(cp_parser_oacc_compute): Adjust call to c_parser_oacc_all_clauses to
set compute_p argument to true.
* pt.cc (tsubst_omp_clauses): Add OMP_CLAUSE_SELF case.
* c-typeck.cc (c_finish_omp_clauses): Add OMP_CLAUSE_SELF case, merged
with OMP_CLAUSE_IF case.

gcc/fortran/ChangeLog:

* gfortran.h (typedef struct gfc_omp_clauses): Add self_expr field.
* openmp.cc (enum omp_mask2): Add OMP_CLAUSE_SELF.
(gfc_match_omp_clauses): Add handling for OMP_CLAUSE_SELF.
(OACC_PARALLEL_CLAUSES): Add OMP_CLAUSE_SELF.
(OACC_KERNELS_CLAUSES): Likewise.
(OACC_SERIAL_CLAUSES): Likewise.
(resolve_omp_clauses): Add handling for omp_clauses->self_expr.
* trans-openmp.cc (gfc_trans_omp_clauses): Add handling of
clauses->self_expr and building of OMP_CLAUSE_SELF tree clause.
(gfc_split_omp_clauses): Add handling of self_expr field copy.

gcc/ChangeLog:

* gimplify.cc (gimplify_scan_omp_clauses): Add OMP_CLAUSE_SELF case.
(gimplify_adjust_omp_clauses): Likewise.
* omp-expand.cc (expand_omp_target): Add OMP_CLAUSE_SELF expansion code,
* omp-low.cc (scan_sharing_clauses): Add OMP_CLAUSE_SELF case.
* tree-core.h (enum omp_clause_code): Add OMP_CLAUSE_SELF enum.
* tree-nested.cc (convert_nonlocal_omp_clauses): Add OMP_CLAUSE_SELF
case.
(convert_local_omp_clauses): Likewise.
* tree-pretty-print.cc (dump_omp_clause): Add OMP_CLAUSE_SELF case.
* tree.cc (omp_clause_num_ops): Add OMP_CLAUSE_SELF entry.
(omp_clause_code_name): Likewise.
* tree.h (OMP_CLAUSE_SELF_EXPR): New macro.

gcc/testsuite/ChangeLog:

* c-c++-common/goacc/self-clause-1.c: New test.
* c-c++-common/goacc/self-clause-2.c: New test.
* gfortran.dg/goacc/self.f95: New test.

include/ChangeLog:

* gomp-constants.h (GOACC_FLAG_LOCAL_DEVICE): New flag bit value.

libgomp/ChangeLog:

* oacc-parallel.c (GOACC_parallel_keyed): Add code to handle
GOACC_FLAG_LOCAL_DEVICE case.
* testsuite/libgomp.oacc-c-c++-common/self-1.c: New test.From 449883981c8e1f707b47ff8f8dd70049b9ffda82 Mon Sep 17 00:00:00 2001
From: Chung-Lin Tang 
Date: Tue, 13 Jun 2023 08:44:31 -0700
Subject: [PATCH] OpenACC 2.7: Implement self clause for compute constructs

This patch implements the 'self' clause for compute constructs: parallel,
kernels, and serial. This clause conditionally uses the local device
(the host mult-core CPU) as the executing device of the compute region.

The actual implementation of the "local device" device type inside libgomp
(presumably using pthreads) is still not yet completed, so the libgomp
side is still implemented the exact same as host-fallback mode. (so as of now,
it essentially behaves like the 'if' clause with the condition inverted)

gcc/c/ChangeLog:

* c-parser.cc (c_parser_oacc_compute_clause_self): New function.
(c_parser_oacc_all_clauses): Add new 'bool compute_p = false'
parameter, add parsing of self clause when compute_p is true.
(OACC_KERNELS_CLAUSE_MASK): Add PRAGMA_OACC_CLAUSE_SELF.
(OACC_PARALLEL_CLAUSE_MASK): Likewise,
(OACC_SERIAL_CLAUSE_MASK): Likewise.
(c_parser_oacc_compute): Adjust call to c_parser_oacc_all_clauses to
set compute_p argument to true.
* c-typeck.cc (c_finish_omp_clauses): Add OMP_CLAUSE_SELF case.

gcc/cp/ChangeLog:

* parser.cc (cp_parser_oacc_compute_c

[PATCH, OpenACC 2.7, v2] Implement host_data must have use_device clause requirement

2023-07-13 Thread Chung-Lin Tang via Gcc-patches
On 2023/6/16 5:13 PM, Thomas Schwinge wrote:
> OK with one small change, please -- unless there's a reason for doing it
> this way:
> 
>> --- a/gcc/fortran/trans-openmp.cc
>> +++ b/gcc/fortran/trans-openmp.cc
>> @@ -4677,6 +4677,12 @@ gfc_trans_oacc_construct (gfc_code *code)
>>   break;
>>case EXEC_OACC_HOST_DATA:
>>   construct_code = OACC_HOST_DATA;
>> + if (code->ext.omp_clauses->lists[OMP_LIST_USE_DEVICE] == NULL)
>> +   {
>> + error_at (gfc_get_location (&code->loc),
>> +   "% construct requires % 
>> clause");
>> + return NULL_TREE;
>> +   }
>>   break;
>>default:
>>   gcc_unreachable ();
> The OpenMP "must contain at least one [...] clause" checks are done in
> 'gcc/fortran/openmp.cc:resolve_omp_clauses'.  For consistency (or, to let
> 'gcc/fortran/trans-openmp.cc' continue to just deal with "directive
> translation"), do similar for OpenACC 'host_data'?  (..., and we later
> accordingly adjust 'gcc/fortran/openmp.cc:gfc_match_oacc_update', too?)

Hi Thomas,
I've adjusted the Fortran implementation as you described. Yes, I agree this way
more fits current Fortran FE conventions.

I've re-tested the attached v2 patch, will commit later this week if no major
objections.

Thanks,
Chung-Lin

gcc/c/ChangeLog:

* c-parser.cc (c_parser_oacc_host_data): Add checking requiring OpenACC
host_data construct to have an use_device clause.

gcc/cp/ChangeLog:

* parser.cc (cp_parser_oacc_host_data): Add checking requiring OpenACC
host_data construct to have an use_device clause.

gcc/fortran/ChangeLog:

* openmp.cc (resolve_omp_clauses): Add checking requiring
OpenACC host_data construct to have an use_device clause.

gcc/testsuite/ChangeLog:

* c-c++-common/goacc/host_data-2.c: Adjust testcase.
* gfortran.dg/goacc/host_data-error.f90: New testcase.
* gfortran.dg/goacc/pr71704.f90: Adjust testcase.diff --git a/gcc/c/c-parser.cc b/gcc/c/c-parser.cc
index 24a6eb6e459..80920b31f83 100644
--- a/gcc/c/c-parser.cc
+++ b/gcc/c/c-parser.cc
@@ -18461,8 +18461,13 @@ c_parser_oacc_host_data (location_t loc, c_parser 
*parser, bool *if_p)
   tree stmt, clauses, block;
 
   clauses = c_parser_oacc_all_clauses (parser, OACC_HOST_DATA_CLAUSE_MASK,
-  "#pragma acc host_data");
-
+  "#pragma acc host_data", false);
+  if (!omp_find_clause (clauses, OMP_CLAUSE_USE_DEVICE_PTR))
+{
+  error_at (loc, "% construct requires % clause");
+  return error_mark_node;
+}
+  clauses = c_finish_omp_clauses (clauses, C_ORT_ACC);
   block = c_begin_omp_parallel ();
   add_stmt (c_parser_omp_structured_block (parser, if_p));
   stmt = c_finish_oacc_host_data (loc, clauses, block);
diff --git a/gcc/cp/parser.cc b/gcc/cp/parser.cc
index 5e2b5cba57e..beb5b632e5e 100644
--- a/gcc/cp/parser.cc
+++ b/gcc/cp/parser.cc
@@ -45895,8 +45895,15 @@ cp_parser_oacc_host_data (cp_parser *parser, cp_token 
*pragma_tok, bool *if_p)
   unsigned int save;
 
   clauses = cp_parser_oacc_all_clauses (parser, OACC_HOST_DATA_CLAUSE_MASK,
-   "#pragma acc host_data", pragma_tok);
-
+   "#pragma acc host_data", pragma_tok,
+   false);
+  if (!omp_find_clause (clauses, OMP_CLAUSE_USE_DEVICE_PTR))
+{
+  error_at (pragma_tok->location,
+   "% construct requires % clause");
+  return error_mark_node;
+}
+  clauses = finish_omp_clauses (clauses, C_ORT_ACC);
   block = begin_omp_parallel ();
   save = cp_parser_begin_omp_structured_block (parser);
   cp_parser_statement (parser, NULL_TREE, false, if_p);
diff --git a/gcc/fortran/openmp.cc b/gcc/fortran/openmp.cc
index 8efc4b3ecfa..f7af02845de 100644
--- a/gcc/fortran/openmp.cc
+++ b/gcc/fortran/openmp.cc
@@ -8764,6 +8764,12 @@ resolve_omp_clauses (gfc_code *code, gfc_omp_clauses 
*omp_clauses,
   "% clause", &omp_clauses->detach->where);
 }
 
+  if (openacc
+  && code->op == EXEC_OACC_HOST_DATA
+  && omp_clauses->lists[OMP_LIST_USE_DEVICE] == NULL)
+gfc_error ("% construct at %L requires % clause",
+  &code->loc);
+
   if (omp_clauses->assume)
 gfc_resolve_omp_assumptions (omp_clauses->assume);
 }
diff --git a/gcc/testsuite/c-c++-common/goacc/host_data-2.c 
b/gcc/testsuite/c-c++-common/goacc/host_data-2.c
index b3093e575ff..862a764eb3a 100644
--- a/gcc/testsuite/c-c++-common/goacc/host_data-2.c
+++ b/gcc/testsuite/c-c++-common/goacc/host_data-2.c
@@ -8,7 +8,9 @@ void
 f (void)
 {
   int v2 = 3;
-#pragma acc host_data copy(v2) /* { dg-error ".copy. is not valid for ..pragma 
acc host_data." } */
+#pragma acc host_data copy(v2)
+  /* { dg-error ".copy. is not valid for ..pragma acc host_data." "" { target 
*-*-* } .-1 } */
+  /* { dg-error ".host_data. construct requires .use_device. clause" "" { 
target *-*-* } .-2 } */

Re: [PATCH, OpenACC 2.7] Implement default clause support for data constructs

2023-07-14 Thread Chung-Lin Tang via Gcc-patches
Hi Thomas,

On 2023/6/23 6:47 PM, Thomas Schwinge wrote:
>> +
>>ctx->clauses = *orig_list_p;
>>gimplify_omp_ctxp = ctx;
>>  }
> Instead of this, in 'gimplify_omp_workshare', before the
> 'gimplify_scan_omp_clauses' call, do something like:
> 
> if ((ort & ORT_ACC)
> && !omp_find_clause (OMP_CLAUSES (expr), OMP_CLAUSE_DEFAULT))
>   {
> /* Determine effective 'default' clause for OpenACC compute 
> construct.  */
> for (struct gimplify_omp_ctx *ctx = gimplify_omp_ctxp; ctx; ctx = 
> ctx->outer_context)
>   {
> if (ctx->region_type == ORT_ACC_DATA
> && ctx->default_kind != OMP_CLAUSE_DEFAULT_SHARED)
>   {
> [Append actual default clause on compute construct.]
> break;
>   }
>   }
>   }
> 
> That seems conceptually simpler to me?

I'm not sure if this is conceptually simpler, but using 'oacc_default_kind'
is definitely faster computationally :)

However, as you mention below...

> For the 'build_omp_clause', does using 'ctx->location' instead of
> 'UNKNOWN_LOCATION' help diagnostics in any way?  Like if we add in
> 'gcc/gimplify.cc:oacc_default_clause',
> 'if (ctx->default_kind == OMP_CLAUSE_DEFAULT_NONE)' another 'inform' to
> point to the 'data' construct's 'default' clause?  (But not sure if
> that's easily done; otherwise don't.)

Noticed that we will need to track the actually lexically enclosing OpenACC 
construct
with the user set default-clause somewhere in 'ctx', in order to satisfy the 
current
diagnostics in oacc_default_clause().

(the UNKNOWN_LOCATION for the internally created default-clause probably doesn't
matter, that one is just for reminder in internal dumps, probably never plays 
role
in user diagnostics)

> Similar to the ones you've already got, please also add a few test cases
> for nested 'default' clauses, like:
> 
> #pragma acc data // no vs. 'default(none)' vs. 'default(present)'
> {
>   #pragma acc data // no vs. same vs. different 'default' clause
>   {
> #pragma acc data // no vs. same vs. different 'default' clause
> {
>   #pragma acc parallel
> 
> Similarly, test cases where 'default' on the compute construct overrides
> 'default' of an outer 'data' construct.

Okay, will add more testcases.

Thanks,
Chung-Lin


[PATCH, OpenACC 2.7] Connect readonly modifier to points-to analysis

2023-07-25 Thread Chung-Lin Tang via Gcc-patches
On 2023/7/11 2:33 AM, Chung-Lin Tang via Gcc-patches wrote:
> As we discussed earlier, the work for actually linking this to middle-end
> points-to analysis is a somewhat non-trivial issue. This first patch allows
> the language feature to be used in OpenACC directives first (with no effect 
> for now).
> The middle-end changes are probably going to be a later patch.

This second patch tries to link the readonly modifier to points-to analysis.

There already exists SSA_NAME_POINTS_TO_READONLY_MEMORY and it's support in the
alias oracle routines in tree-ssa-alias.cc, so basically what this patch does is
try to make the variables holding the array section base pointers to have this
flag set.

There is an another OMP_CLAUSE_MAP_POINTS_TO_READONLY set by front-ends on the
associated pointer clauses if OMP_CLAUSE_MAP_READONLY is set.
Also a DECL_POINTS_TO_READONLY flag is set for VAR_DECLs when creating the tmp
vars carrying these receiver references on the offloaded side. These
eventually get translated to SSA_NAME_POINTS_TO_READONLY_MEMORY.

This still doesn't always work as expected in terms of optimization:
struct pointer fields and Fortran arrays (kind of like C structs) which have
several accesses to create the pointer access on the receive/offloaded side,
and SRA appears to not work on these sequences, so gets in the way of much
redundancy elimination.

Currently have one testcase where we can demonstrate 'readonly' can avoid
a clobber by function call. Tested on powerpc64le-linux/nvptx.

Note this patch is create a-top of the front-end patch.
(will respond to the other front-end patch comments later)

Thanks,
Chung-Lin

2023-07-25  Chung-Lin Tang  

gcc/c/ChangeLog:

* c-typeck.cc (handle_omp_array_sections):
Set OMP_CLAUSE_MAP_POINTS_TO_READONLY on pointer clause.

gcc/cp/ChangeLog:

* semantics.cc (handle_omp_array_sections):
Set OMP_CLAUSE_MAP_POINTS_TO_READONLY on pointer clause.

gcc/fortran/ChangeLog:

* trans-openmp.cc (gfc_trans_omp_array_section):
Set OMP_CLAUSE_MAP_POINTS_TO_READONLY on pointer clause.

gcc/ChangeLog:

* gimple-expr.cc (copy_var_decl): Copy DECL_POINTS_TO_READONLY
for VAR_DECLs.
* gimplify.cc (struct gimplify_omp_ctx):
Add 'hash_set *pt_readonly_ptrs' field.
(internal_get_tmp_var): Set
DECL_POINTS_TO_READONLY/SSA_NAME_POINTS_TO_READONLY_MEMORY for
new temp vars.
(build_omp_struct_comp_nodes):
Set OMP_CLAUSE_MAP_POINTS_TO_READONLY on pointer clause.
(gimplify_scan_omp_clauses): Collect OMP_CLAUSE_MAP_POINTS_TO_READONLY
to ctx->pt_readonly_ptrs.
* omp-low.cc (lower_omp_target): Set DECL_POINTS_TO_READONLY for
variables of receiver refs.
* tree-pretty-print.cc (dump_omp_clause):
Print OMP_CLAUSE_MAP_POINTS_TO_READONLY.
(dump_generic_node): Print SSA_NAME_POINTS_TO_READONLY_MEMORY.
* tree.h (DECL_POINTS_TO_READONLY): New macro.
(OMP_CLAUSE_MAP_POINTS_TO_READONLY): New macro.

gcc/testsuite/ChangeLog:

* c-c++-common/goacc/readonly-1.c: Adjust testcase.
* c-c++-common/goacc/readonly-2.c: New testcase.
* gfortran.dg/goacc/readonly-1.f90: Adjust testcase.
diff --git a/gcc/c/c-typeck.cc b/gcc/c/c-typeck.cc
index 7cf411155c6..42591e4029a 100644
--- a/gcc/c/c-typeck.cc
+++ b/gcc/c/c-typeck.cc
@@ -14258,6 +14258,8 @@ handle_omp_array_sections (tree c, enum 
c_omp_region_type ort)
OMP_CLAUSE_SET_MAP_KIND (c2, GOMP_MAP_ATTACH_DETACH);
   else
OMP_CLAUSE_SET_MAP_KIND (c2, GOMP_MAP_FIRSTPRIVATE_POINTER);
+  if (OMP_CLAUSE_MAP_READONLY (c))
+   OMP_CLAUSE_MAP_POINTS_TO_READONLY (c2) = 1;
   OMP_CLAUSE_MAP_IMPLICIT (c2) = OMP_CLAUSE_MAP_IMPLICIT (c);
   if (OMP_CLAUSE_MAP_KIND (c2) != GOMP_MAP_FIRSTPRIVATE_POINTER
  && !c_mark_addressable (t))
diff --git a/gcc/cp/semantics.cc b/gcc/cp/semantics.cc
index 8fb47fd179e..6ab467e1140 100644
--- a/gcc/cp/semantics.cc
+++ b/gcc/cp/semantics.cc
@@ -5872,6 +5872,8 @@ handle_omp_array_sections (tree c, enum c_omp_region_type 
ort)
}
  else
OMP_CLAUSE_SET_MAP_KIND (c2, GOMP_MAP_FIRSTPRIVATE_POINTER);
+ if (OMP_CLAUSE_MAP_READONLY (c))
+   OMP_CLAUSE_MAP_POINTS_TO_READONLY (c2) = 1;
  OMP_CLAUSE_MAP_IMPLICIT (c2) = OMP_CLAUSE_MAP_IMPLICIT (c);
  if (OMP_CLAUSE_MAP_KIND (c2) != GOMP_MAP_FIRSTPRIVATE_POINTER
  && !cxx_mark_addressable (t))
diff --git a/gcc/fortran/trans-openmp.cc b/gcc/fortran/trans-openmp.cc
index 2253d559f9c..d7cd65af1bb 100644
--- a/gcc/fortran/trans-openmp.cc
+++ b/gcc/fortran/trans-openmp.cc
@@ -2524,6 +2524,8 @@ gfc_trans_omp_array_section (stmtblock_t *block, 
gfc_exec_op op,
   node3 = build_omp_clause (input_location, OMP_CLAUSE_MAP);
   OMP_CLAUSE_SET_MAP_KIND (node3, ptr_kind);
   OMP_CLAUSE_DECL (node3)

[PATCH, OpenACC 2.7, v2] Implement default clause support for data constructs

2023-08-01 Thread Chung-Lin Tang via Gcc-patches
Hi Thomas,
this is v2 of the patch for implementing the OpenACC 2.7 addition of
default(none|present) support for data constructs.

Instead of propagating an additional 'oacc_default_kind' for OpenACC,
this patch does it in a more complete way: it directly propagates the
gimplify_omp_ctx* pointer of the inner most context where we found
a default-clause. This supports displaying the location/type of OpenACC
construct where the default-clause is in the error messages.

The testcases also have the multiple nested data construct testing added,
where we can now have messages referring precisely to the exact innermost
default clause that was active at that program point.

Note, I got rid of the dummy OMP_CLAUSE_DEFAULT creation in this version,
since it seemed not really needed.

Re-tested on master on powerpc64le-linux/nvptx. Okay to commit?

Thanks,
Chung-Lin

2023-08-01  Chung-Lin Tang  

gcc/c/ChangeLog:
* c-parser.cc (OACC_DATA_CLAUSE_MASK): Add PRAGMA_OACC_CLAUSE_DEFAULT.

gcc/cp/ChangeLog:
* parser.cc (OACC_DATA_CLAUSE_MASK): Add PRAGMA_OACC_CLAUSE_DEFAULT.

gcc/fortran/ChangeLog:
* openmp.cc (OACC_DATA_CLAUSES): Add OMP_CLAUSE_DEFAULT.

gcc/ChangeLog:
* gimplify.cc (struct gimplify_omp_ctx): Add oacc_default_clause_ctx
field.
(new_omp_context): Initialize oacc_default_clause_ctx field.
(oacc_region_type_name): New function.
(oacc_default_clause): Lookup current default_kind value from
ctx->oacc_default_clause_ctx, adjust default(none) error and inform
message dumping.
(gimplify_scan_omp_clauses): Upon OMP_CLAUSE_DEFAULT case, set
ctx->oacc_default_clause_ctx to current context.

gcc/testsuite/ChangeLog:
* c-c++-common/goacc/default-3.c: Adjust testcase.
* c-c++-common/goacc/default-5.c: Adjust testcase.
* gfortran.dg/goacc/default-3.f95: Adjust testcase.
* gfortran.dg/goacc/default-5.f: Adjust testcase.diff --git a/gcc/c/c-parser.cc b/gcc/c/c-parser.cc
index 24a6eb6e459..974f0132787 100644
--- a/gcc/c/c-parser.cc
+++ b/gcc/c/c-parser.cc
@@ -18196,6 +18196,7 @@ c_parser_oacc_cache (location_t loc, c_parser *parser)
| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_COPYIN)  \
| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_COPYOUT) \
| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_CREATE)  \
+   | (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_DEFAULT) \
| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_DEVICEPTR)   \
| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_IF)  \
| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_NO_CREATE)   \
diff --git a/gcc/cp/parser.cc b/gcc/cp/parser.cc
index d7ef5b34d42..bc59fbeac20 100644
--- a/gcc/cp/parser.cc
+++ b/gcc/cp/parser.cc
@@ -45860,6 +45860,7 @@ cp_parser_oacc_cache (cp_parser *parser, cp_token 
*pragma_tok)
| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_COPYIN)  \
| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_COPYOUT) \
| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_CREATE)  \
+   | (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_DEFAULT) \
| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_DETACH)  \
| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_DEVICEPTR)   \
| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_IF)  \
diff --git a/gcc/fortran/openmp.cc b/gcc/fortran/openmp.cc
index 2952cd300ac..c37f843ec3b 100644
--- a/gcc/fortran/openmp.cc
+++ b/gcc/fortran/openmp.cc
@@ -3802,7 +3802,8 @@ error:
 #define OACC_DATA_CLAUSES \
   (omp_mask (OMP_CLAUSE_IF) | OMP_CLAUSE_DEVICEPTR  | OMP_CLAUSE_COPY\
| OMP_CLAUSE_COPYIN | OMP_CLAUSE_COPYOUT | OMP_CLAUSE_CREATE
  \
-   | OMP_CLAUSE_NO_CREATE | OMP_CLAUSE_PRESENT | OMP_CLAUSE_ATTACH)
+   | OMP_CLAUSE_NO_CREATE | OMP_CLAUSE_PRESENT | OMP_CLAUSE_ATTACH   \
+   | OMP_CLAUSE_DEFAULT)
 #define OACC_LOOP_CLAUSES \
   (omp_mask (OMP_CLAUSE_COLLAPSE) | OMP_CLAUSE_GANG | OMP_CLAUSE_WORKER
  \
| OMP_CLAUSE_VECTOR | OMP_CLAUSE_SEQ | OMP_CLAUSE_INDEPENDENT \
diff --git a/gcc/gimplify.cc b/gcc/gimplify.cc
index 320920ed74c..ec0ccc67da8 100644
--- a/gcc/gimplify.cc
+++ b/gcc/gimplify.cc
@@ -225,6 +225,7 @@ struct gimplify_omp_ctx
   vec loop_iter_var;
   location_t location;
   enum omp_clause_default_kind default_kind;
+  struct gimplify_omp_ctx *oacc_default_clause_ctx;
   enum omp_region_type region_type;
   enum tree_code code;
   bool combined_loop;
@@ -459,6 +460,10 @@ new_omp_context (enum omp_region_type region_type)
 c->default_kind = OMP_CLAUSE_DEFAULT_SHARED;
   else
 c->default_kind = OMP_CLAUSE_DEFAULT_UNSPECIFIED;
+  if (gimplify_omp_ctxp)
+c->oacc_default_clause_ctx = gimplify_omp_ctxp->oacc_default_clause_ctx;
+  else
+c->oacc_default_clause_ctx = c;
   c->defaultmap[GDMK_SCALAR] = 

[PATCH, OpenACC 2.7, v2] readonly modifier support in front-ends

2023-08-07 Thread Chung-Lin Tang via Gcc-patches
Hi Thomas, Tobias,
here's the updated v2 of the readonly modifier front-end patch.

On 2023/7/20 11:08 PM, Tobias Burnus wrote:
>>> +++ b/gcc/c/c-parser.cc
>>> @@ -14059,7 +14059,8 @@ c_parser_omp_variable_list (c_parser *parser,
>>>
>>>   static tree
>>>   c_parser_omp_var_list_parens (c_parser *parser, enum omp_clause_code kind,
>>> -   tree list, bool allow_deref = false)
>>> +   tree list, bool allow_deref = false,
>>> +   bool *readonly = NULL)
>>> ...
>> Instead of doing this in 'c_parser_omp_var_list_parens', I think it's
>> clearer to have this special 'readonly :' parsing logic in the two places
>> where it's used.
> I concur. The same issue also occurred for OpenMP's
> c_parser_omp_clause_to, and c_parser_omp_clause_from and the 'present'
> modifier. For it, I created a combined function but the main reason for
> that is that OpenMP also permits more modifiers (like 'iterators'),
> which would cause more duplication of code ('iterator' is not yet
> supported).
> 
> For something as simple to parse as this modifier, I would just do it at
> the two places – as Thomas suggested.

Okay, I've changed the C/C++ parser parts to have the parsing logic directly
added.

>>> +++ b/gcc/fortran/gfortran.h
>>> @@ -1360,7 +1360,11 @@ typedef struct gfc_omp_namelist
>>>   {
>>> gfc_omp_reduction_op reduction_op;
>>> gfc_omp_depend_doacross_op depend_doacross_op;
>>> -  gfc_omp_map_op map_op;
>>> +  struct
>>> +{
>>> +   ENUM_BITFIELD (gfc_omp_map_op) map_op:8;
>>> +   bool readonly;
>>> +};
>>> gfc_expr *align;
>>> struct
>>>{
>> [...] Thus, the above looks good to me.
> I concur but I wonder whether it would be cleaner to name the struct;
> this makes it also more obvious what belongs together in the union.
> 
> Namely, naming the struct 'map' and then changing the 45 users from
> 'u.map_op' to 'u.map.op' and the new 'u.readonly' to 'u.map.readonly'. –
> this seems to be cleaner.

I've adjusted 'u.map' to be a named struct now, and updated the references.

>> + if (gfc_match ("readonly :") == MATCH_YES)
>> I note this one does not have a space after ':' in 'gfc_match', but the
>> one above in 'gfc_match_omp_clauses' does.  I don't know off-hand if that
>> makes a difference in parsing -- probably not, as all of
>> 'gcc/fortran/openmp.cc' generally doesn't seem to be very consistent
>> about these two variants?
> It *does* make a difference. And for obvious reasons. You don't want to 
> permit:
> 
>!$acc kernels asnyccopy(a)
> 
> but require at least one space (or comma) between "async" and "copy"..
> (In fixed form Fortran, it would be fine - as would be "!$acc k e nelsasy nc 
> co p y(a)".)
> 
> A " " matches zero or more whitespaces, but with gfc_match_space you can find 
> out
> whether there was whitespace or not.

Okay, made sure both are 'gfc_match ("readonly : ")'. Thanks for catching that, 
didn't
realize that space was significant.

>>> +++ b/gcc/tree.h
>>> @@ -1813,6 +1813,14 @@ class auto_suppress_location_wrappers
>>>   #define OMP_CLAUSE_MAP_DECL_MAKE_ADDRESSABLE(NODE) \
>>> (OMP_CLAUSE_SUBCODE_CHECK (NODE, OMP_CLAUSE_MAP)->base.addressable_flag)
>>>
>>> +/* Nonzero if OpenACC 'readonly' modifier set, used for 'copyin'.  */
>>> +#define OMP_CLAUSE_MAP_READONLY(NODE) \
>>> +  TREE_READONLY (OMP_CLAUSE_SUBCODE_CHECK (NODE, OMP_CLAUSE_MAP))
>>> +
>>> +/* Same as above, for use in OpenACC cache directives.  */
>>> +#define OMP_CLAUSE__CACHE__READONLY(NODE) \
>>> +  TREE_READONLY (OMP_CLAUSE_SUBCODE_CHECK (NODE, OMP_CLAUSE__CACHE_))
>> I'm not sure if these special accessor functions are actually useful, or
>> we should just directly use 'TREE_READONLY' instead?  We're only using
>> them in contexts where it's clear that the 'OMP_CLAUSE_SUBCODE_CHECK' is
>> satisfied, for example.
> I find directly using TREE_READONLY confusing.

FWIW, I've changed to use TREE_NOTHROW instead, if it can give a better sense 
of safety :P

I think there's a misunderstanding here anyways: we are not relying on a DECL 
marked
TREE_READONLY here. We merely need the OMP_CLAUSE_MAP to be marked as 
OMP_CLAUSE_MAP_READONLY == 1.

The other points-to patch then (also in front-ends) take the 
OMP_CLAUSE_MAP_READONLY
to mark the clauses of "base-pointers of array-sections" as 
OMP_CLAUSE_MAP_POINTS_TO_READONLY,
and later this gradually gets relayed to alias oracle routines in 
tree-ssa-alias.cc

Re-tested this v2 patch on powerpc64le-linux/nvptx. Okay for trunk?

Thanks,
Chung-Lin

2023-08-07  Chung-Lin Tang  

gcc/c/ChangeLog:

* c-parser.cc (c_parser_oacc_data_clause): Add parsing support for
'readonly' modifier, set OMP_CLAUSE_MAP_READONLY if readonly modifier
found, update comments.
(c_parser_oacc_cache): Add parsing support for 'readonly' modifier,
set OMP_CLAUSE__CACHE__READONLY if readonly modifier found, update
comments.


[PATCH, OpenACC 2.7] Adjust acc_map_data/acc_unmap_data interaction with reference counters

2023-06-22 Thread Chung-Lin Tang via Gcc-patches
Hi Thomas,
This patch adjusts the implementation of acc_map_data/acc_unmap_data API library
routines to more fit the description in the OpenACC 2.7 specification.

Instead of using REFCOUNT_INFINITY, we now define a REFCOUNT_ACC_MAP_DATA
special value to mark acc_map_data-created mappings, and allow adjustment of
dynamic_refcount of such mappings by other constructs. Enforcing of an initial
value of 1 for such mappings, and only allowing acc_unmap_data to delete such
mappings, is implemented as specified.

Actually, there is no real change (or improvement) in behavior of the API (thus
no new tests) I've looked at the related OpenACC spec issues, and it seems that
this part of the 2.7 spec change is mostly a clarification (see no downside in
current REFCOUNT_INFINITY based implementation either).
But this patch does make the internals more close to the spec description.

Tested without regressions using powerpc64le-linux/nvptx, okay for trunk?

Thanks,
Chung-Lin

2023-06-22  Chung-Lin Tang  

libgomp/ChangeLog:

* libgomp.h (REFCOUNT_ACC_MAP_DATA): Define as (REFCOUNT_SPECIAL | 2).
* oacc-mem.c (acc_map_data): Adjust to use REFCOUNT_ACC_MAP_DATA,
initialize dynamic_refcount as 1.
(acc_unmap_data): Adjust to use REFCOUNT_ACC_MAP_DATA,
(goacc_map_var_existing): Add REFCOUNT_ACC_MAP_DATA case.
(goacc_exit_datum_1): Add REFCOUNT_ACC_MAP_DATA case, respect
REFCOUNT_ACC_MAP_DATA when decrementing/finalizing. Force lowest
dynamic_refcount to be 1 for REFCOUNT_ACC_MAP_DATA.
* target.c (gomp_increment_refcount): Add REFCOUNT_ACC_MAP_DATA case.
(gomp_decrement_refcount): Add REFCOUNT_ACC_MAP_DATA case, force lowest
dynamic_refcount to be 1 for REFCOUNT_ACC_MAP_DATA.
* testsuite/libgomp.oacc-c-c++-common/unmap-infinity-1.c: Adjust
testcase error output scan test.
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 4d2bfab4b71..fb8ef651dfb 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -1166,6 +1166,8 @@ struct target_mem_desc;
 /* Special value for refcount - tgt_offset contains target address of the
artificial pointer to "omp declare target link" object.  */
 #define REFCOUNT_LINK (REFCOUNT_SPECIAL | 1)
+/* Special value for refcount - created through acc_map_data.  */
+#define REFCOUNT_ACC_MAP_DATA (REFCOUNT_SPECIAL | 2)
 
 /* Special value for refcount - structure element sibling list items.
All such key refounts have REFCOUNT_STRUCTELEM bits set, with _FLAG_FIRST
diff --git a/libgomp/oacc-mem.c b/libgomp/oacc-mem.c
index fe632740769..2a782ac22c1 100644
--- a/libgomp/oacc-mem.c
+++ b/libgomp/oacc-mem.c
@@ -411,7 +411,8 @@ acc_map_data (void *h, void *d, size_t s)
   assert (n->refcount == 1);
   assert (n->dynamic_refcount == 0);
   /* Special reference counting behavior.  */
-  n->refcount = REFCOUNT_INFINITY;
+  n->refcount = REFCOUNT_ACC_MAP_DATA;
+  n->dynamic_refcount = 1;
 
   if (profiling_p)
{
@@ -460,7 +461,7 @@ acc_unmap_data (void *h)
  the different 'REFCOUNT_INFINITY' cases, or simply separate
  'REFCOUNT_INFINITY' values per different usage ('REFCOUNT_ACC_MAP_DATA'
  etc.)?  */
-  else if (n->refcount != REFCOUNT_INFINITY)
+  else if (n->refcount != REFCOUNT_ACC_MAP_DATA)
 {
   gomp_mutex_unlock (&acc_dev->lock);
   gomp_fatal ("refusing to unmap block [%p,+%d] that has not been mapped"
@@ -519,7 +520,8 @@ goacc_map_var_existing (struct gomp_device_descr *acc_dev, 
void *hostaddr,
 }
 
   assert (n->refcount != REFCOUNT_LINK);
-  if (n->refcount != REFCOUNT_INFINITY)
+  if (n->refcount != REFCOUNT_INFINITY
+  && n->refcount != REFCOUNT_ACC_MAP_DATA)
 n->refcount++;
   n->dynamic_refcount++;
 
@@ -683,6 +685,7 @@ goacc_exit_datum_1 (struct gomp_device_descr *acc_dev, void 
*h, size_t s,
 
   assert (n->refcount != REFCOUNT_LINK);
   if (n->refcount != REFCOUNT_INFINITY
+  && n->refcount != REFCOUNT_ACC_MAP_DATA
   && n->refcount < n->dynamic_refcount)
 {
   gomp_mutex_unlock (&acc_dev->lock);
@@ -691,15 +694,27 @@ goacc_exit_datum_1 (struct gomp_device_descr *acc_dev, 
void *h, size_t s,
 
   if (finalize)
 {
-  if (n->refcount != REFCOUNT_INFINITY)
+  if (n->refcount != REFCOUNT_INFINITY
+ && n->refcount != REFCOUNT_ACC_MAP_DATA)
n->refcount -= n->dynamic_refcount;
-  n->dynamic_refcount = 0;
+
+  if (n->refcount == REFCOUNT_ACC_MAP_DATA)
+   /* Mappings created by acc_map_data are returned to initial
+  dynamic_refcount of 1. Can only be deleted by acc_unmap_data.  */
+   n->dynamic_refcount = 1;
+  else
+   n->dynamic_refcount = 0;
 }
   else if (n->dynamic_refcount)
 {
-  if (n->refcount != REFCOUNT_INFINITY)
+  if (n->refcount != REFCOUNT_INFINITY
+ && n->refcount != REFCOUNT_ACC_MAP_DATA)
n->refcount--;
-  n->dynamic_refcount--;
+
+  /* When mapping is created by acc_map_data, dynam

[PATCH, OpenACC 2.7] readonly modifier support in front-ends

2023-07-10 Thread Chung-Lin Tang via Gcc-patches
Hi Thomas,
this patch contains support for the 'readonly' modifier in copyin clauses
and the cache directive.

As we discussed earlier, the work for actually linking this to middle-end
points-to analysis is a somewhat non-trivial issue. This first patch allows
the language feature to be used in OpenACC directives first (with no effect for 
now).
The middle-end changes are probably going to be a later patch.

(Also CCing Tobias because of the Fortran bits)

Tested on powerpc64le-linux with nvptx offloading. Is this okay for trunk?

Thanks,
Chung-Lin

2023-07-10  Chung-Lin Tang  

gcc/c/ChangeLog:
* c-parser.cc (c_parser_omp_var_list_parens):
Add 'bool *readonly = NULL' parameter, add readonly modifier parsing
support.
(c_parser_oacc_data_clause): Adjust c_parser_omp_var_list_parens call
to turn on readonly modifier parsing for copyin clause, set
OMP_CLAUSE_MAP_READONLY if readonly modifier found, update comments.
(c_parser_oacc_cache): Adjust c_parser_omp_var_list_parens call
to turn on readonly modifier parsing, set OMP_CLAUSE__CACHE__READONLY
if readonly modifier found, update comments.

gcc/cp/ChangeLog:
* parser.cc (cp_parser_omp_var_list):
Add 'bool *readonly = NULL' parameter, add readonly modifier parsing
support.
(cp_parser_oacc_data_clause): Adjust cp_parser_omp_var_list call
to turn on readonly modifier parsing for copyin clause, set
OMP_CLAUSE_MAP_READONLY if readonly modifier found, update comments.
(cp_parser_oacc_cache): Adjust cp_parser_omp_var_list call
to turn on readonly modifier parsing, set OMP_CLAUSE__CACHE__READONLY
if readonly modifier found, update comments.

gcc/fortran/ChangeLog:
* gfortran.h (typedef struct gfc_omp_namelist): Adjust map_op as
ENUM_BITFIELD field, add 'bool readonly' field.
* openmp.cc (gfc_match_omp_map_clause): Add 'bool readonly = false'
parameter, set n->u.readonly field.
(gfc_match_omp_clauses): Add readonly modifier parsing for OpenACC
copyin clause, adjust call to gfc_match_omp_map_clause.
(gfc_match_oacc_cache): Add readonly modifier parsing for OpenACC
cache directive, adjust call to gfc_match_omp_map_clause.
* trans-openmp.cc (gfc_trans_omp_clauses): Set OMP_CLAUSE_MAP_READONLY,
OMP_CLAUSE__CACHE__READONLY to 1 when readonly is set.

gcc/ChangeLog:
* tree-pretty-print.cc (dump_omp_clause): Add support for printing
OMP_CLAUSE_MAP_READONLY and OMP_CLAUSE__CACHE__READONLY.
* tree.h (OMP_CLAUSE_MAP_READONLY): New macro.
(OMP_CLAUSE__CACHE__READONLY): New macro.

gcc/testsuite/ChangeLog:
* c-c++-common/goacc/readonly-1.c: New test.
* gfortran.dg/goacc/readonly-1.f90: New test.

diff --git a/gcc/c/c-parser.cc b/gcc/c/c-parser.cc
index d4b98d5d8b6..09e1e89d793 100644
--- a/gcc/c/c-parser.cc
+++ b/gcc/c/c-parser.cc
@@ -14059,7 +14059,8 @@ c_parser_omp_variable_list (c_parser *parser,
 
 static tree
 c_parser_omp_var_list_parens (c_parser *parser, enum omp_clause_code kind,
- tree list, bool allow_deref = false)
+ tree list, bool allow_deref = false,
+ bool *readonly = NULL)
 {
   /* The clauses location.  */
   location_t loc = c_parser_peek_token (parser)->location;
@@ -14067,6 +14068,20 @@ c_parser_omp_var_list_parens (c_parser *parser, enum 
omp_clause_code kind,
   matching_parens parens;
   if (parens.require_open (parser))
 {
+  if (readonly != NULL)
+   {
+ c_token *token = c_parser_peek_token (parser);
+ if (token->type == CPP_NAME
+ && !strcmp (IDENTIFIER_POINTER (token->value), "readonly")
+ && c_parser_peek_2nd_token (parser)->type == CPP_COLON)
+   {
+ c_parser_consume_token (parser);
+ c_parser_consume_token (parser);
+ *readonly = true;
+   }
+ else
+   *readonly = false;
+   }
   list = c_parser_omp_variable_list (parser, loc, kind, list, allow_deref);
   parens.skip_until_found_close (parser);
 }
@@ -14084,7 +14099,11 @@ c_parser_omp_var_list_parens (c_parser *parser, enum 
omp_clause_code kind,
OpenACC 2.6:
no_create ( variable-list )
attach ( variable-list )
-   detach ( variable-list ) */
+   detach ( variable-list )
+
+   OpenACC 2.7:
+   copyin (readonly : variable-list )
+ */
 
 static tree
 c_parser_oacc_data_clause (c_parser *parser, pragma_omp_clause c_kind,
@@ -14135,11 +14154,22 @@ c_parser_oacc_data_clause (c_parser *parser, 
pragma_omp_clause c_kind,
 default:
   gcc_unreachable ();
 }
+
+  /* Turn on readonly modifier parsing for copyin clause.  */
+  bool readonly = false, *readonly_ptr = NULL;
+  if (c_kind == PRAGMA_OACC_CLAUSE_COPYIN)
+readonly_ptr = &readonly;
+
   tree nl, c;
-  nl = c_parser_omp_var_list_parens (pars

[PATCH, nvptx, 1/2] Reimplement libgomp barriers for nvptx

2022-09-21 Thread Chung-Lin Tang via Gcc-patches

Hi Tom,
I had a patch submitted earlier, where I reported that the current way of 
implementing
barriers in libgomp on nvptx created a quite significant performance drop on 
some SPEChpc2021
benchmarks:
https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html

That previous patch wasn't accepted well (admittedly, it was kind of a hack).
So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX.

Basically, instead of trying to have the GPU do CPU-with-OS-like things that it 
isn't suited for,
barriers are implemented simplistically with bar.* synchronization instructions.
Tasks are processed after threads have joined, and only if team->task_count != 0

(arguably, there might be a little bit of performance forfeited where earlier 
arriving threads
could've been used to process tasks ahead of other threads. But that again 
falls into requiring
implementing complex futex-wait/wake like behavior. Really, that kind of 
tasking is not what target
offloading is usually used for)

Implementation highlight notes:
1. gomp_team_barrier_wake() is now an empty function (threads never "wake" in 
the usual manner)
2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction.
3. gomp_barrier_wait_last() now is implemented using "bar.arrive"

4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end():
   The main synchronization is done using a 'bar.red' instruction. This reduces 
across all threads
   the condition (team->task_count != 0), to enable the task processing down 
below if any thread
   created a task. (this bar.red usage required the need of the second GCC 
patch in this series)

This patch has been tested on x86_64/powerpc64le with nvptx offloading, using 
libgomp, ovo, omptests,
and sollve_vv testsuites, all without regressions. Also verified that the 
SPEChpc 2021 521.miniswp_t
and 534.hpgmgfv_t performance regressions that occurred in the GCC12 cycle has 
been restored to
devel/omp/gcc-11 (OG11) branch levels. Is this okay for trunk?

(also suggest backporting to GCC12 branch, if performance regression can be 
considered a defect)

Thanks,
Chung-Lin

libgomp/ChangeLog:

2022-09-21  Chung-Lin Tang  

* config/nvptx/bar.c (generation_to_barrier): Remove.
(futex_wait,futex_wake,do_spin,do_wait): Remove.
(GOMP_WAIT_H): Remove.
(#include "../linux/bar.c"): Remove.
(gomp_barrier_wait_end): New function.
(gomp_barrier_wait): Likewise.
(gomp_barrier_wait_last): Likewise.
(gomp_team_barrier_wait_end): Likewise.
(gomp_team_barrier_wait): Likewise.
(gomp_team_barrier_wait_final): Likewise.
(gomp_team_barrier_wait_cancel_end): Likewise.
(gomp_team_barrier_wait_cancel): Likewise.
(gomp_team_barrier_cancel): Likewise.
* config/nvptx/bar.h (gomp_team_barrier_wake): Remove
prototype, add new static inline function.
diff --git a/libgomp/config/nvptx/bar.c b/libgomp/config/nvptx/bar.c
index eee2107..0b958ed 100644
--- a/libgomp/config/nvptx/bar.c
+++ b/libgomp/config/nvptx/bar.c
@@ -30,137 +30,143 @@
 #include 
 #include "libgomp.h"
 
-/* For cpu_relax.  */
-#include "doacross.h"
-
-/* Assuming ADDR is &bar->generation, return bar.  Copied from
-   rtems/bar.c.  */
+void
+gomp_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state)
+{
+  if (__builtin_expect (state & BAR_WAS_LAST, 0))
+{
+  /* Next time we'll be awaiting TOTAL threads again.  */
+  bar->awaited = bar->total;
+  __atomic_store_n (&bar->generation, bar->generation + BAR_INCR,
+   MEMMODEL_RELEASE);
+}
+  if (bar->total > 1)
+asm ("bar.sync 1, %0;" : : "r" (32 * bar->total));
+}
 
-static gomp_barrier_t *
-generation_to_barrier (int *addr)
+void
+gomp_barrier_wait (gomp_barrier_t *bar)
 {
-  char *bar
-= (char *) addr - __builtin_offsetof (gomp_barrier_t, generation);
-  return (gomp_barrier_t *)bar;
+  gomp_barrier_wait_end (bar, gomp_barrier_wait_start (bar));
 }
 
-/* Implement futex_wait-like behaviour to plug into the linux/bar.c
-   implementation.  Assumes ADDR is &bar->generation.   */
+/* Like gomp_barrier_wait, except that if the encountering thread
+   is not the last one to hit the barrier, it returns immediately.
+   The intended usage is that a thread which intends to gomp_barrier_destroy
+   this barrier calls gomp_barrier_wait, while all other threads
+   call gomp_barrier_wait_last.  When gomp_barrier_wait returns,
+   the barrier can be safely destroyed.  */
 
-static inline void
-futex_wait (int *addr, int val)
+void
+gomp_barrier_wait_last (gomp_barrier_t *bar)
 {
-  gomp_barrier_t *bar = generation_to_barrier (addr);
+  /* The above described behavior matches 'bar.arrive' perfectly.  */
+  if (bar->total > 1)
+asm ("bar.arrive 1, %0;" : : "r" (32 * bar->total));
+}
 
-  if (bar->total < 2)
-/* A barrier with less than two threads, nop.  */
-return;
+void
+gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_

[PATCH, nvptx, 2/2] Reimplement libgomp barriers for nvptx: bar.red instruction support in GCC

2022-09-21 Thread Chung-Lin Tang via Gcc-patches

Hi Tom, following the first patch.

This new barrier implementation I posted in the first patch uses the 'bar.red' 
instruction.
Usually this could've been easily done with a single line of inline assembly. 
However I quickly
realized that because the NVPTX GCC port is implemented with all virtual 
general registers,
we don't have a register constraint usable to select "predicate registers".
Since bar.red uses predicate typed values, I can't create it directly using 
inline asm.

So it appears that the most simple way of accessing it is with a target builtin.
The attached patch adds bar.red instructions to the nvptx port, and 
__builtin_nvptx_bar_red_* builtins
to use it. The code should support all variations of bar.red (and, or, and popc 
operations).

(This support was used to implement the first libgomp barrier patch, so must be 
approved together)

Thanks,
Chung-Lin

2022-09-21  Chung-Lin Tang  

gcc/ChangeLog:

* config/nvptx/nvptx.cc (nvptx_print_operand): Add 'p'
case, adjust comments.
(enum nvptx_builtins): Add NVPTX_BUILTIN_BAR_RED_AND,
NVPTX_BUILTIN_BAR_RED_OR, and NVPTX_BUILTIN_BAR_RED_POPC.
(nvptx_expand_bar_red): New function.
(nvptx_init_builtins):
Add DEFs of __builtin_nvptx_bar_red_[and/or/popc].
(nvptx_expand_builtin): Use nvptx_expand_bar_red to expand
NVPTX_BUILTIN_BAR_RED_[AND/OR/POPC] cases.

* config/nvptx/nvptx.md (define_c_enum "unspecv"): Add
UNSPECV_BARRED_AND, UNSPECV_BARRED_OR, and UNSPECV_BARRED_POPC.
(BARRED): New int iterator.
(barred_op,barred_mode,barred_ptxtype): New int attrs.
(nvptx_barred_): New define_insn.
diff --git a/gcc/config/nvptx/nvptx.cc b/gcc/config/nvptx/nvptx.cc
index 49cc681..afc3a890 100644
--- a/gcc/config/nvptx/nvptx.cc
+++ b/gcc/config/nvptx/nvptx.cc
@@ -2879,6 +2879,7 @@ nvptx_mem_maybe_shared_p (const_rtx x)
t -- print a type opcode suffix, promoting QImode to 32 bits
T -- print a type size in bits
u -- print a type opcode suffix without promotions.
+   p -- print a '!' for constant 0.
x -- print a destination operand that may also be a bit bucket.  */
 
 static void
@@ -3012,6 +3013,11 @@ nvptx_print_operand (FILE *file, rtx x, int code)
   fprintf (file, "@!");
   goto common;
 
+case 'p':
+  if (INTVAL (x) == 0)
+   fprintf (file, "!");
+  break;
+
 case 'c':
   mode = GET_MODE (XEXP (x, 0));
   switch (x_code)
@@ -6151,9 +6157,90 @@ enum nvptx_builtins
   NVPTX_BUILTIN_CMP_SWAPLL,
   NVPTX_BUILTIN_MEMBAR_GL,
   NVPTX_BUILTIN_MEMBAR_CTA,
+  NVPTX_BUILTIN_BAR_RED_AND,
+  NVPTX_BUILTIN_BAR_RED_OR,
+  NVPTX_BUILTIN_BAR_RED_POPC,
   NVPTX_BUILTIN_MAX
 };
 
+/* Expander for 'bar.red' instruction builtins.  */
+
+static rtx
+nvptx_expand_bar_red (tree exp, rtx target,
+ machine_mode ARG_UNUSED (m), int ARG_UNUSED (ignore))
+{
+  int code = DECL_MD_FUNCTION_CODE (TREE_OPERAND (CALL_EXPR_FN (exp), 0));
+  machine_mode mode = TYPE_MODE (TREE_TYPE (exp));
+
+  if (!target)
+target = gen_reg_rtx (mode);
+
+  rtx pred, dst;
+  rtx bar = expand_expr (CALL_EXPR_ARG (exp, 0),
+NULL_RTX, SImode, EXPAND_NORMAL);
+  rtx nthr = expand_expr (CALL_EXPR_ARG (exp, 1),
+ NULL_RTX, SImode, EXPAND_NORMAL);
+  rtx cpl = expand_expr (CALL_EXPR_ARG (exp, 2),
+NULL_RTX, SImode, EXPAND_NORMAL);
+  rtx redop = expand_expr (CALL_EXPR_ARG (exp, 3),
+  NULL_RTX, SImode, EXPAND_NORMAL);
+  if (CONST_INT_P (bar))
+{
+  if (INTVAL (bar) < 0 || INTVAL (bar) > 15)
+   {
+ error_at (EXPR_LOCATION (exp),
+   "barrier value must be within [0,15]");
+ return const0_rtx;
+   }
+}
+  else if (!REG_P (bar))
+bar = copy_to_mode_reg (SImode, bar);
+
+  if (!CONST_INT_P (nthr) && !REG_P (nthr))
+nthr = copy_to_mode_reg (SImode, nthr);
+
+  if (!CONST_INT_P (cpl))
+{
+  error_at (EXPR_LOCATION (exp),
+   "complement argument must be constant");
+  return const0_rtx;
+}
+
+  pred = gen_reg_rtx (BImode);
+  if (!REG_P (redop))
+redop = copy_to_mode_reg (SImode, redop);
+  emit_insn (gen_rtx_SET (pred, gen_rtx_NE (BImode, redop, GEN_INT (0;
+  redop = pred;
+
+  rtx pat;
+  switch (code)
+{
+case NVPTX_BUILTIN_BAR_RED_AND:
+  dst = gen_reg_rtx (BImode);
+  pat = gen_nvptx_barred_and (dst, bar, nthr, cpl, redop);
+  break;
+case NVPTX_BUILTIN_BAR_RED_OR:
+  dst = gen_reg_rtx (BImode);
+  pat = gen_nvptx_barred_or (dst, bar, nthr, cpl, redop);
+  break;
+case NVPTX_BUILTIN_BAR_RED_POPC:
+  dst = gen_reg_rtx (SImode);
+  pat = gen_nvptx_barred_popc (dst, bar, nthr, cpl, redop);
+  break;
+default:
+  gcc_unreachable ();
+}
+  emit_insn (pat);
+  if (GET_MODE (dst) == BImode)
+{
+  rtx tmp = gen_reg_rtx (mode);
+  emit_insn (gen_rtx_SET (tmp, gen_rtx_NE (m

Re: [PATCH, nvptx, 1/2] Reimplement libgomp barriers for nvptx

2022-09-21 Thread Chung-Lin Tang via Gcc-patches




On 2022/9/21 5:01 PM, Jakub Jelinek wrote:

On Wed, Sep 21, 2022 at 03:45:36PM +0800, Chung-Lin Tang via Gcc-patches wrote:

Hi Tom,
I had a patch submitted earlier, where I reported that the current way of 
implementing
barriers in libgomp on nvptx created a quite significant performance drop on 
some SPEChpc2021
benchmarks:
https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html

That previous patch wasn't accepted well (admittedly, it was kind of a hack).
So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX.

Basically, instead of trying to have the GPU do CPU-with-OS-like things that it 
isn't suited for,
barriers are implemented simplistically with bar.* synchronization instructions.
Tasks are processed after threads have joined, and only if team->task_count != 0

(arguably, there might be a little bit of performance forfeited where earlier 
arriving threads
could've been used to process tasks ahead of other threads. But that again 
falls into requiring
implementing complex futex-wait/wake like behavior. Really, that kind of 
tasking is not what target
offloading is usually used for)


I admit I don't have a good picture if people in real-world actually use
tasking in offloading regions and how much and in what way, but the above
definitely would be a show-stopper for typical tasking workloads, where
one thread (usually from master/masked/single construct's body) creates lots
of tasks and can spend considerable amount of time in those preparations,
while other threads are expected to handle those tasks.


I think the most common use case for target offloading is "parallel for".

Really, not simply removing tasking altogether from target regions in the 
specification is just looking for trouble.

If asynchronous offloaded tasks are to be supported, something at the whole GPU 
offload region level
is much more reasonable, like the async clause functionality in OpenACC.


Do we have an idea how are other implementations handling this?
I think it should be easily observable with atomics, have
master/masked/single that creates lots of tasks and then spends a long time
doing something, have very small task bodies that just increment some atomic
counter and at the end of the master/masked/single see how many tasks were
already encountered.


This could be an interesting test...


Note, I don't have any smart ideas how to handle this instead and what
you posted might be ok for what people usually do on offloading targets
in OpenMP if they use tasking at all, just wanted to mention that there
could be workloads where the above is a serious problem.  If there are
say hundreds of threads doing nothing until a single thread reaches a
barrier and there are hundreds of pending tasks...


I think it might still be doable, just not in the very fine "wake one thread" 
style
that the Linux-based implementation was doing.


E.g. note we have that 64 pending task limit after which we start to
create undeferred tasks, so if we never start handling tasks until
one thread is done with them, that would mean the single thread
would create 64 deferred tasks and then handle all the others itself
making it even longer until the other tasks can deal with it.


Okay, thanks for reminding that.

Chung-Lin


[PATCH, OG10, OpenMP 5.0, committed] Implement relaxation of implicit map vs. existing device mappings

2021-05-05 Thread Chung-Lin Tang via Gcc-patches

This patch implements relaxing the requirements when a map with the implicit 
attribute encounters
an overlapping existing map. As the OpenMP 5.0 spec describes on page 320, 
lines 18-27 (and 5.1 spec,
page 352, lines 13-22):

"If a single contiguous part of the original storage of a list item with an 
implicit data-mapping
 attribute has corresponding storage in the device data environment prior to a 
task encountering the
 construct that is associated with the map clause, only that part of the 
original storage will have
 corresponding storage in the device data environment as a result of the map 
clause."

Also tracked in the OpenMP spec context as issue #1463:
https://github.com/OpenMP/spec/issues/1463

The implementation inside the compiler is to of course, tag the implicitly 
created maps with some
indication of "implicit". I've done this with a OMP_CLAUSE_MAP_IMPLICIT_P 
macro, using
'base.deprecated_flag' underneath.

There is an encoding of this as GOMP_MAP_IMPLICIT == 
GOMP_MAP_FLAG_SPECIAL_3|GOMP_MAP_FLAG_SPECIAL_4
in include/gomp-constants.h for the runtime, but I've intentionally avoided 
exploding the entire
gimplify/omp-low with a new set of GOMP_MAP_IMPLICIT_TO/FROM/etc. symbols, 
instead adding in the new
flag bits only at the final runtime call generation during omp-lowering.

The rest is libgomp mapping taking care of the implicit case: allowing map 
success if an existing
map is a proper subset of the new map, if the new map is implicit. 
Straightforward enough I think.

There are also some additions to print the implicit attribute during tree 
pretty-printing, for that
reason some scan tests were updated.

Also, another adjustment in this patch is how implicitly created clauses are 
added to the current
clause list in gimplify_adjust_omp_clauses(). Instead of simply appending the 
new clauses to the end,
this patch adds them at the position "after initial non-map clauses, but right 
before any existing
map clauses".

The reason for this is: when combined with other map clauses, for example:

  #pragma omp target map(rec.ptr[:N])
  for (int i = 0; i < N; i++)
rec.ptr[i] += 1;

There will be an implicit map created for map(rec), because of the access 
inside the target region.
The expectation is that 'rec' is implicitly mapped, and then the pointed 
array-section part by 'rec.ptr'
will be mapped, and then attachment to the 'rec.ptr' field of the mapped 'rec' 
(in that order).

If the implicit 'map(rec)' is appended to the end, instead of placed before 
other maps, the attachment
operation will not find anything to attach to, and the entire region will fail.

Note: this touches a bit on another issue which I will be sending a patch for 
later:
per the discussion on omp-lang, an array section list item should *not* be 
mapping its base-pointer
(although an attachment attempt should exist), while in current GCC behavior, 
for struct member pointers
like 'rec.ptr' above, we do map it (which should be deemed incorrect).

This means that as of right now, this modification of map order doesn't really 
exhibit the above mentioned
behavior yet. I have included it as part of this patch because the "[implicit]" 
tree printing requires
modifying many gimple scan tests already, so including the test modifications 
together seems more
manageable patch-wise.

Tested with no regressions, and pushed to devel/omp/gcc-10. Will be submitting 
a mainline trunk version later.

Chung-Lin

2021-05-05  Chung-Lin Tang  

include/ChangeLog:

* gomp-constants.h (GOMP_MAP_IMPLICIT): New special map kind bits value.
(GOMP_MAP_FLAG_SPECIAL_BITS): Define helper mask for whole set of
special map kind bits.
(GOMP_MAP_NONCONTIG_ARRAY_P): Adjust test for non-contiguous array map
kind bits to be more specific.
(GOMP_MAP_IMPLICIT_P): New predicate macro for implicit map kinds.

gcc/ChangeLog:

* tree.h (OMP_CLAUSE_MAP_IMPLICIT_P): New access macro for 'implicit'
bit, using 'base.deprecated_flag' field of tree_node.
* tree-pretty-print.c (dump_omp_clause): Add support for printing
implicit attribute in tree dumping.
* gimplify.c (gimplify_adjust_omp_clauses_1):
Set OMP_CLAUSE_MAP_IMPLICIT_P to 1 if map clause is implicitly created.
(gimplify_adjust_omp_clauses): Adjust place of adding implicitly created
clauses, from simple append, to starting of list, after non-map clauses.
* omp-low.c (lower_omp_target): Add GOMP_MAP_IMPLICIT bits into kind
values passed to libgomp for implicit maps.

gcc/testsuite/ChangeLog:

* c-c++-common/gomp/target-implicit-map-1.c: New test.
* c-c++-common/goacc/combined-reduction.c: Adjust scan test pattern.
* c-c++-common/goacc/firstprivate-mappings-1.c: Likewise.
* c-c++-common/goacc/mdc-1.c: Likewise.
* c-c++-common/goacc/reduction-1.c: Likewise.
* c-c++-common/goacc/reduction-2.c: Likewise.
* c-c++-common/goacc

[Ping x6] Re: [PATCH, nvptx, 1/2] Reimplement libgomp barriers for nvptx

2022-12-12 Thread Chung-Lin Tang via Gcc-patches
Ping x6

On 2022/12/6 12:21 AM, Chung-Lin Tang wrote:
> Ping x5
> 
> On 2022/11/22 12:24 上午, Chung-Lin Tang wrote:
>> Ping x4
>>
>> On 2022/11/8 12:34 AM, Chung-Lin Tang wrote:
>>> Ping x3.
>>>
>>> On 2022/10/31 10:18 PM, Chung-Lin Tang wrote:
>>>> Ping x2.
>>>>
>>>> On 2022/10/17 10:29 PM, Chung-Lin Tang wrote:
>>>>> Ping.
>>>>>
>>>>> On 2022/9/21 3:45 PM, Chung-Lin Tang via Gcc-patches wrote:
>>>>>> Hi Tom,
>>>>>> I had a patch submitted earlier, where I reported that the current way 
>>>>>> of implementing
>>>>>> barriers in libgomp on nvptx created a quite significant performance 
>>>>>> drop on some SPEChpc2021
>>>>>> benchmarks:
>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html 
>>>>>> That previous patch wasn't accepted well (admittedly, it was kind of a 
>>>>>> hack).
>>>>>> So in this patch, I tried to (mostly) re-implement team-barriers for 
>>>>>> NVPTX.
>>>>>>
>>>>>> Basically, instead of trying to have the GPU do CPU-with-OS-like things 
>>>>>> that it isn't suited for,
>>>>>> barriers are implemented simplistically with bar.* synchronization 
>>>>>> instructions.
>>>>>> Tasks are processed after threads have joined, and only if 
>>>>>> team->task_count != 0
>>>>>>
>>>>>> (arguably, there might be a little bit of performance forfeited where 
>>>>>> earlier arriving threads
>>>>>> could've been used to process tasks ahead of other threads. But that 
>>>>>> again falls into requiring
>>>>>> implementing complex futex-wait/wake like behavior. Really, that kind of 
>>>>>> tasking is not what target
>>>>>> offloading is usually used for)
>>>>>>
>>>>>> Implementation highlight notes:
>>>>>> 1. gomp_team_barrier_wake() is now an empty function (threads never 
>>>>>> "wake" in the usual manner)
>>>>>> 2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction.
>>>>>> 3. gomp_barrier_wait_last() now is implemented using "bar.arrive"
>>>>>>
>>>>>> 4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end():
>>>>>> The main synchronization is done using a 'bar.red' instruction. This 
>>>>>> reduces across all threads
>>>>>> the condition (team->task_count != 0), to enable the task processing 
>>>>>> down below if any thread
>>>>>> created a task. (this bar.red usage required the need of the second 
>>>>>> GCC patch in this series)
>>>>>>
>>>>>> This patch has been tested on x86_64/powerpc64le with nvptx offloading, 
>>>>>> using libgomp, ovo, omptests,
>>>>>> and sollve_vv testsuites, all without regressions. Also verified that 
>>>>>> the SPEChpc 2021 521.miniswp_t
>>>>>> and 534.hpgmgfv_t performance regressions that occurred in the GCC12 
>>>>>> cycle has been restored to
>>>>>> devel/omp/gcc-11 (OG11) branch levels. Is this okay for trunk?
>>>>>>
>>>>>> (also suggest backporting to GCC12 branch, if performance regression can 
>>>>>> be considered a defect)
>>>>>>
>>>>>> Thanks,
>>>>>> Chung-Lin
>>>>>>
>>>>>> libgomp/ChangeLog:
>>>>>>
>>>>>> 2022-09-21  Chung-Lin Tang  
>>>>>>
>>>>>>  * config/nvptx/bar.c (generation_to_barrier): Remove.
>>>>>>  (futex_wait,futex_wake,do_spin,do_wait): Remove.
>>>>>>  (GOMP_WAIT_H): Remove.
>>>>>>  (#include "../linux/bar.c"): Remove.
>>>>>>  (gomp_barrier_wait_end): New function.
>>>>>>  (gomp_barrier_wait): Likewise.
>>>>>>  (gomp_barrier_wait_last): Likewise.
>>>>>>  (gomp_team_barrier_wait_end): Likewise.
>>>>>>  (gomp_team_barrier_wait): Likewise.
>>>>>>  (gomp_team_barrier_wait_final): Likewise.
>>>>>>  (gomp_team_barrier_wait_cancel_end): Likewise.
>>>>>>  (gomp_team_barrier_wait_cancel): Likewise.
>>>>>>  (gomp_team_barrier_cancel): Likewise.
>>>>>>  * config/nvptx/bar.h (gomp_team_barrier_wake): Remove
>>>>>>  prototype, add new static inline function.
>>>
>>
> 



Re: nvptx: Avoid deadlock in 'cuStreamAddCallback' callback, error case (was: [PATCH 6/6, OpenACC, libgomp] Async re-work, nvptx changes)

2023-01-13 Thread Chung-Lin Tang via Gcc-patches
Hi Thomas,

On 2023/1/12 9:51 PM, Thomas Schwinge wrote:
> In my case, 'cuda_callback_wrapper' (expectedly) gets invoked with
> 'res != CUDA_SUCCESS' ("an illegal memory access was encountered").
> When we invoke 'GOMP_PLUGIN_fatal', this attempts to shut down the device
> (..., which deadlocks); that's generally problematic: per
> https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__STREAM.html#group__CUDA__STREAM_1g613d97a277d7640f4cb1c03bd51c2483
> "'cuStreamAddCallback' [...] Callbacks must not make any CUDA API calls".

I remember running into this myself when first creating this async support
(IIRC in my case it was cuFree()-ing something) yet you've found another 
mistake here! :) 

> Given that eventually we must reach a host/device synchronization point
> (latest when the device is shut down at program termination), and the
> non-'CUDA_SUCCESS' will be upheld until then, it does seem safe to
> replace this 'GOMP_PLUGIN_fatal' with 'GOMP_PLUGIN_error' as per the
> "nvptx: Avoid deadlock in 'cuStreamAddCallback' callback, error case"
> attached.  OK to push?

I think this patch is fine. Actual approval powers are your's or Tom's :)

> 
> (Might we even skip 'GOMP_PLUGIN_error' here, understanding that the
> error will be caught and reported at the next host/device synchronization
> point?  But I've not verified that.)

Actually, the CUDA driver API docs are a bit vague on what exactly this
CUresult arg to the callback actually means. The 'res != CUDA_SUCCESS' handling
here was basically just generic handling. I am not really sure what is the
true right thing to do here (is the error still retained by CUDA after the 
callback
completes?)

Chung-Lin



[Ping x3] Re: [PATCH, nvptx, 1/2] Reimplement libgomp barriers for nvptx

2022-11-07 Thread Chung-Lin Tang via Gcc-patches
Ping x3.

On 2022/10/31 10:18 PM, Chung-Lin Tang wrote:
> Ping x2.
> 
> On 2022/10/17 10:29 PM, Chung-Lin Tang wrote:
>> Ping.
>>
>> On 2022/9/21 3:45 PM, Chung-Lin Tang via Gcc-patches wrote:
>>> Hi Tom,
>>> I had a patch submitted earlier, where I reported that the current way of 
>>> implementing
>>> barriers in libgomp on nvptx created a quite significant performance drop 
>>> on some SPEChpc2021
>>> benchmarks:
>>> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html
>>>
>>> That previous patch wasn't accepted well (admittedly, it was kind of a 
>>> hack).
>>> So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX.
>>>
>>> Basically, instead of trying to have the GPU do CPU-with-OS-like things 
>>> that it isn't suited for,
>>> barriers are implemented simplistically with bar.* synchronization 
>>> instructions.
>>> Tasks are processed after threads have joined, and only if team->task_count 
>>> != 0
>>>
>>> (arguably, there might be a little bit of performance forfeited where 
>>> earlier arriving threads
>>> could've been used to process tasks ahead of other threads. But that again 
>>> falls into requiring
>>> implementing complex futex-wait/wake like behavior. Really, that kind of 
>>> tasking is not what target
>>> offloading is usually used for)
>>>
>>> Implementation highlight notes:
>>> 1. gomp_team_barrier_wake() is now an empty function (threads never "wake" 
>>> in the usual manner)
>>> 2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction.
>>> 3. gomp_barrier_wait_last() now is implemented using "bar.arrive"
>>>
>>> 4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end():
>>> The main synchronization is done using a 'bar.red' instruction. This 
>>> reduces across all threads
>>> the condition (team->task_count != 0), to enable the task processing 
>>> down below if any thread
>>> created a task. (this bar.red usage required the need of the second GCC 
>>> patch in this series)
>>>
>>> This patch has been tested on x86_64/powerpc64le with nvptx offloading, 
>>> using libgomp, ovo, omptests,
>>> and sollve_vv testsuites, all without regressions. Also verified that the 
>>> SPEChpc 2021 521.miniswp_t
>>> and 534.hpgmgfv_t performance regressions that occurred in the GCC12 cycle 
>>> has been restored to
>>> devel/omp/gcc-11 (OG11) branch levels. Is this okay for trunk?
>>>
>>> (also suggest backporting to GCC12 branch, if performance regression can be 
>>> considered a defect)
>>>
>>> Thanks,
>>> Chung-Lin
>>>
>>> libgomp/ChangeLog:
>>>
>>> 2022-09-21  Chung-Lin Tang  
>>>
>>> * config/nvptx/bar.c (generation_to_barrier): Remove.
>>> (futex_wait,futex_wake,do_spin,do_wait): Remove.
>>> (GOMP_WAIT_H): Remove.
>>> (#include "../linux/bar.c"): Remove.
>>> (gomp_barrier_wait_end): New function.
>>> (gomp_barrier_wait): Likewise.
>>> (gomp_barrier_wait_last): Likewise.
>>> (gomp_team_barrier_wait_end): Likewise.
>>> (gomp_team_barrier_wait): Likewise.
>>> (gomp_team_barrier_wait_final): Likewise.
>>> (gomp_team_barrier_wait_cancel_end): Likewise.
>>> (gomp_team_barrier_wait_cancel): Likewise.
>>> (gomp_team_barrier_cancel): Likewise.
>>> * config/nvptx/bar.h (gomp_team_barrier_wake): Remove
>>> prototype, add new static inline function.



[Ping x4] Re: [PATCH, nvptx, 1/2] Reimplement libgomp barriers for nvptx

2022-11-21 Thread Chung-Lin Tang via Gcc-patches
Ping x4

On 2022/11/8 12:34 AM, Chung-Lin Tang wrote:
> Ping x3.
> 
> On 2022/10/31 10:18 PM, Chung-Lin Tang wrote:
>> Ping x2.
>>
>> On 2022/10/17 10:29 PM, Chung-Lin Tang wrote:
>>> Ping.
>>>
>>> On 2022/9/21 3:45 PM, Chung-Lin Tang via Gcc-patches wrote:
>>>> Hi Tom,
>>>> I had a patch submitted earlier, where I reported that the current way of 
>>>> implementing
>>>> barriers in libgomp on nvptx created a quite significant performance drop 
>>>> on some SPEChpc2021
>>>> benchmarks:
>>>> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html>>>>>
>>>> That previous patch wasn't accepted well (admittedly, it was kind of a 
>>>> hack).
>>>> So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX.
>>>>
>>>> Basically, instead of trying to have the GPU do CPU-with-OS-like things 
>>>> that it isn't suited for,
>>>> barriers are implemented simplistically with bar.* synchronization 
>>>> instructions.
>>>> Tasks are processed after threads have joined, and only if 
>>>> team->task_count != 0
>>>>
>>>> (arguably, there might be a little bit of performance forfeited where 
>>>> earlier arriving threads
>>>> could've been used to process tasks ahead of other threads. But that again 
>>>> falls into requiring
>>>> implementing complex futex-wait/wake like behavior. Really, that kind of 
>>>> tasking is not what target
>>>> offloading is usually used for)
>>>>
>>>> Implementation highlight notes:
>>>> 1. gomp_team_barrier_wake() is now an empty function (threads never "wake" 
>>>> in the usual manner)
>>>> 2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction.
>>>> 3. gomp_barrier_wait_last() now is implemented using "bar.arrive"
>>>>
>>>> 4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end():
>>>> The main synchronization is done using a 'bar.red' instruction. This 
>>>> reduces across all threads
>>>> the condition (team->task_count != 0), to enable the task processing 
>>>> down below if any thread
>>>> created a task. (this bar.red usage required the need of the second 
>>>> GCC patch in this series)
>>>>
>>>> This patch has been tested on x86_64/powerpc64le with nvptx offloading, 
>>>> using libgomp, ovo, omptests,
>>>> and sollve_vv testsuites, all without regressions. Also verified that the 
>>>> SPEChpc 2021 521.miniswp_t
>>>> and 534.hpgmgfv_t performance regressions that occurred in the GCC12 cycle 
>>>> has been restored to
>>>> devel/omp/gcc-11 (OG11) branch levels. Is this okay for trunk?
>>>>
>>>> (also suggest backporting to GCC12 branch, if performance regression can 
>>>> be considered a defect)
>>>>
>>>> Thanks,
>>>> Chung-Lin
>>>>
>>>> libgomp/ChangeLog:
>>>>
>>>> 2022-09-21  Chung-Lin Tang  
>>>>
>>>>* config/nvptx/bar.c (generation_to_barrier): Remove.
>>>>(futex_wait,futex_wake,do_spin,do_wait): Remove.
>>>>(GOMP_WAIT_H): Remove.
>>>>(#include "../linux/bar.c"): Remove.
>>>>(gomp_barrier_wait_end): New function.
>>>>(gomp_barrier_wait): Likewise.
>>>>(gomp_barrier_wait_last): Likewise.
>>>>(gomp_team_barrier_wait_end): Likewise.
>>>>(gomp_team_barrier_wait): Likewise.
>>>>(gomp_team_barrier_wait_final): Likewise.
>>>>(gomp_team_barrier_wait_cancel_end): Likewise.
>>>>(gomp_team_barrier_wait_cancel): Likewise.
>>>>(gomp_team_barrier_cancel): Likewise.
>>>>* config/nvptx/bar.h (gomp_team_barrier_wake): Remove
>>>>prototype, add new static inline function.
> 



[Ping x5] Re: [PATCH, nvptx, 1/2] Reimplement libgomp barriers for nvptx

2022-12-05 Thread Chung-Lin Tang via Gcc-patches
Ping x5

On 2022/11/22 12:24 上午, Chung-Lin Tang wrote:
> Ping x4
> 
> On 2022/11/8 12:34 AM, Chung-Lin Tang wrote:
>> Ping x3.
>>
>> On 2022/10/31 10:18 PM, Chung-Lin Tang wrote:
>>> Ping x2.
>>>
>>> On 2022/10/17 10:29 PM, Chung-Lin Tang wrote:
>>>> Ping.
>>>>
>>>> On 2022/9/21 3:45 PM, Chung-Lin Tang via Gcc-patches wrote:
>>>>> Hi Tom,
>>>>> I had a patch submitted earlier, where I reported that the current way of 
>>>>> implementing
>>>>> barriers in libgomp on nvptx created a quite significant performance drop 
>>>>> on some SPEChpc2021
>>>>> benchmarks:
>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html>>>>>>
>>>>>  That previous patch wasn't accepted well (admittedly, it was kind of a 
>>>>> hack).
>>>>> So in this patch, I tried to (mostly) re-implement team-barriers for 
>>>>> NVPTX.
>>>>>
>>>>> Basically, instead of trying to have the GPU do CPU-with-OS-like things 
>>>>> that it isn't suited for,
>>>>> barriers are implemented simplistically with bar.* synchronization 
>>>>> instructions.
>>>>> Tasks are processed after threads have joined, and only if 
>>>>> team->task_count != 0
>>>>>
>>>>> (arguably, there might be a little bit of performance forfeited where 
>>>>> earlier arriving threads
>>>>> could've been used to process tasks ahead of other threads. But that 
>>>>> again falls into requiring
>>>>> implementing complex futex-wait/wake like behavior. Really, that kind of 
>>>>> tasking is not what target
>>>>> offloading is usually used for)
>>>>>
>>>>> Implementation highlight notes:
>>>>> 1. gomp_team_barrier_wake() is now an empty function (threads never 
>>>>> "wake" in the usual manner)
>>>>> 2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction.
>>>>> 3. gomp_barrier_wait_last() now is implemented using "bar.arrive"
>>>>>
>>>>> 4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end():
>>>>> The main synchronization is done using a 'bar.red' instruction. This 
>>>>> reduces across all threads
>>>>> the condition (team->task_count != 0), to enable the task processing 
>>>>> down below if any thread
>>>>> created a task. (this bar.red usage required the need of the second 
>>>>> GCC patch in this series)
>>>>>
>>>>> This patch has been tested on x86_64/powerpc64le with nvptx offloading, 
>>>>> using libgomp, ovo, omptests,
>>>>> and sollve_vv testsuites, all without regressions. Also verified that the 
>>>>> SPEChpc 2021 521.miniswp_t
>>>>> and 534.hpgmgfv_t performance regressions that occurred in the GCC12 
>>>>> cycle has been restored to
>>>>> devel/omp/gcc-11 (OG11) branch levels. Is this okay for trunk?
>>>>>
>>>>> (also suggest backporting to GCC12 branch, if performance regression can 
>>>>> be considered a defect)
>>>>>
>>>>> Thanks,
>>>>> Chung-Lin
>>>>>
>>>>> libgomp/ChangeLog:
>>>>>
>>>>> 2022-09-21  Chung-Lin Tang  
>>>>>
>>>>>   * config/nvptx/bar.c (generation_to_barrier): Remove.
>>>>>   (futex_wait,futex_wake,do_spin,do_wait): Remove.
>>>>>   (GOMP_WAIT_H): Remove.
>>>>>   (#include "../linux/bar.c"): Remove.
>>>>>   (gomp_barrier_wait_end): New function.
>>>>>   (gomp_barrier_wait): Likewise.
>>>>>   (gomp_barrier_wait_last): Likewise.
>>>>>   (gomp_team_barrier_wait_end): Likewise.
>>>>>   (gomp_team_barrier_wait): Likewise.
>>>>>   (gomp_team_barrier_wait_final): Likewise.
>>>>>   (gomp_team_barrier_wait_cancel_end): Likewise.
>>>>>   (gomp_team_barrier_wait_cancel): Likewise.
>>>>>   (gomp_team_barrier_cancel): Likewise.
>>>>>   * config/nvptx/bar.h (gomp_team_barrier_wake): Remove
>>>>>   prototype, add new static inline function.
>>
>