Hi!

So I recently had reason to verify how 'static' variables behave in
OpenACC compute construct regions as well as OpenACC 'routine'.  Two
weeks ago I started writing a few testcases -- and today then wondered if
maybe there's something in the GCC archives about this.  And there is:
directly related are <https://gcc.gnu.org/PR84991> "[openacc] Misleading
error message for function static var in routine",
<https://gcc.gnu.org/PR84992> "[openacc] function static var in
parallel", and more generally <https://gcc.gnu.org/PR90779> "Fortran
array initialization in offload regions" that is discussed here.  (I had
taken part in at least some of these discussions, yet didn't directly
remember these now...  8-| Too much going on?)

Anyway:

On 2019-06-15T10:19:23+0200, Tom de Vries <tdevr...@suse.de> wrote:
> On 13-06-19 10:34, Jakub Jelinek wrote:
>> The OpenMP specification isn't clear on this, I'll work on getting that
>> clarified for 5.1, but the agreement on omp-lang has been that it should
>> work the way the patch implements it, static block scope variables inside of
>> #pragma omp target or #pragma omp declare target routines are handled as if
>> they have #pragma omp declare target to (variable).
>>
>> Bootstrapped/regtested on x86_64-linux and i686-linux, unfortunately it
>> regresses:
>> +FAIL: c-c++-common/goacc/routine-5.c  (test for errors, line 204)
>>
>> Thus, I'm not committing it right now and want to ask what should be done
>> for OpenACC.
>
> OpenACC 2.6 - 2.15.1. Routine Directive - Restrictions:
> ...
> In C and C++, function static variables are not supported in functions
> to which a routine directive applies.
> ...
> [ And text is still the same for 2.7. ]

..., and still in OpenACC 3.1.

But yes, that seems somewhat incomplete and/or inconsistent.  I've filed
<https://github.com/OpenACC/openacc-spec/issues/372> "C/C++ 'static'
variables" (only visible to members of the GitHub OpenACC organization).

I do agree that even if OpenACC ultimately doesn't want to support
certain cases of (?) 'static' variables, we still have to (and evidently
can) support 'static' for compiler-synthesized variables, per our own
desired semantics, which happen to match OpenMP's (as I understand this).

>> The patch uses & ORT_TARGET tests, so it affects both OpenMP
>> target region, and OpenACC parallel/kernels and both OpenMP and OpenACC
>> target routines.  Is it ok to do it that way and just adjust the routine-5.c
>> test, or shall it test (ctx->region_type & (ORT_TARGET | ORT_ACC)) ==
>> ORT_TARGET, i.e. only OpenMP and not OpenACC?  If so, there is still the
>> problem that gimplify_body.c does:
>>   if (flag_openacc || flag_openmp)
>>     {
>>       gcc_assert (gimplify_omp_ctxp == NULL);
>>       if (lookup_attribute ("omp declare target", DECL_ATTRIBUTES (fndecl)))
>>         gimplify_omp_ctxp = new_omp_context (ORT_TARGET);
>>     }
>> We'd need different attribute (or additional attribute) for OpenACC routines
>> and would need to use new_omp_context (cond ? ORT_TARGET : ORT_ACC_PARALLEL)
>> or similar to express OpenACC routines.

I'm fine to have this supported for GCC/OpenACC in the way that it
currently is, so no need to special-case that.  If OpenACC decides
otherwise, we'll then adjust.


(I have not reviewed the PR90779 code changes; it was sufficient for my
case to understand what I called GCC's observed behavior.)


I've now pushed "Add 'libgomp.oacc-c-c++-common/static-variable-1.c'
[PR84991, PR84992, PR90779]" to master branch in commit
ffa0ae6eeef3ad15d3f288283e4c477193052f1a, and releases/gcc-10 branch in
commit 60b589b5858fb8ad414583c6b493e0897f1bde5f, see attached.  (The
PR90779 code changes never got backported to GCC 9 and 8 release
branches.)

Also I've filed <https://gcc.gnu.org/PR100001> "[GCN offloading]
Occasional C++ 'libgomp.oacc-c-c++-common/static-variable-1.c' execution
failure".


Grüße
 Thomas


>> 2019-06-12  Jakub Jelinek  <ja...@redhat.com>
>>
>>      PR middle-end/90779
>>      * gimplify.c (gimplify_bind_expr): Add "omp declare target" attributes
>>      to static block scope variables inside of target region or target
>>      functions.
>>
>>      * testsuite/libgomp.c/pr90779.c: New test.
>>      * testsuite/libgomp.fortran/pr90779.f90: New test.
>>
>> --- gcc/gimplify.c.jj        2019-06-10 19:42:03.868959986 +0200
>> +++ gcc/gimplify.c   2019-06-12 13:00:18.765167777 +0200
>> @@ -1323,17 +1323,37 @@ gimplify_bind_expr (tree *expr_p, gimple
>>        struct gimplify_omp_ctx *ctx = gimplify_omp_ctxp;
>>
>>        /* Mark variable as local.  */
>> -      if (ctx && ctx->region_type != ORT_NONE && !DECL_EXTERNAL (t)
>> -          && (! DECL_SEEN_IN_BIND_EXPR_P (t)
>> -              || splay_tree_lookup (ctx->variables,
>> -                                    (splay_tree_key) t) == NULL))
>> +      if (ctx && ctx->region_type != ORT_NONE && !DECL_EXTERNAL (t))
>>          {
>> -          if (ctx->region_type == ORT_SIMD
>> -              && TREE_ADDRESSABLE (t)
>> -              && !TREE_STATIC (t))
>> -            omp_add_variable (ctx, t, GOVD_PRIVATE | GOVD_SEEN);
>> -          else
>> -            omp_add_variable (ctx, t, GOVD_LOCAL | GOVD_SEEN);
>> +          if (! DECL_SEEN_IN_BIND_EXPR_P (t)
>> +              || splay_tree_lookup (ctx->variables,
>> +                                    (splay_tree_key) t) == NULL)
>> +            {
>> +              if (ctx->region_type == ORT_SIMD
>> +                  && TREE_ADDRESSABLE (t)
>> +                  && !TREE_STATIC (t))
>> +                omp_add_variable (ctx, t, GOVD_PRIVATE | GOVD_SEEN);
>> +              else
>> +                omp_add_variable (ctx, t, GOVD_LOCAL | GOVD_SEEN);
>> +            }
>> +          /* Static locals inside of target construct or offloaded
>> +             routines need to be "omp declare target".  */
>> +          if (TREE_STATIC (t))
>> +            for (; ctx; ctx = ctx->outer_context)
>> +              if ((ctx->region_type & ORT_TARGET) != 0)
>> +                {
>> +                  if (!lookup_attribute ("omp declare target",
>> +                                         DECL_ATTRIBUTES (t)))
>> +                    {
>> +                      tree id = get_identifier ("omp declare target");
>> +                      DECL_ATTRIBUTES (t)
>> +                        = tree_cons (id, NULL_TREE, DECL_ATTRIBUTES (t));
>> +                      varpool_node *node = varpool_node::get (t);
>> +                      if (node)
>> +                        node->offloadable = 1;
>> +                    }
>> +                  break;
>> +                }
>>          }
>>
>>        DECL_SEEN_IN_BIND_EXPR_P (t) = 1;
>> --- libgomp/testsuite/libgomp.c/pr90779.c.jj 2019-06-12 13:01:57.081667587 
>> +0200
>> +++ libgomp/testsuite/libgomp.c/pr90779.c    2019-06-12 12:41:15.637730797 
>> +0200
>> @@ -0,0 +1,18 @@
>> +/* PR middle-end/90779 */
>> +
>> +extern void abort (void);
>> +
>> +int
>> +main ()
>> +{
>> +  int i, j;
>> +  for (i = 0; i < 2; ++i)
>> +    #pragma omp target map(from: j)
>> +    {
>> +      static int k = 5;
>> +      j = ++k;
>> +    }
>> +  if (j != 7)
>> +    abort ();
>> +  return 0;
>> +}
>> --- libgomp/testsuite/libgomp.fortran/pr90779.f90.jj 2019-06-12 
>> 12:43:17.891825811 +0200
>> +++ libgomp/testsuite/libgomp.fortran/pr90779.f90    2019-06-12 
>> 12:43:08.421973375 +0200
>> @@ -0,0 +1,12 @@
>> +! PR middle-end/90779
>> +
>> +program pr90779
>> +  implicit none
>> +  integer :: v(4), i
>> +
>> +  !$omp target map(from:v)
>> +    v(:) = (/ (i, i=1,4) /)
>> +  !$omp end target
>> +
>> +  if (any (v .ne. (/ (i, i=1,4) /))) stop 1
>> +end program
>>
>>      Jakub
>>


-----------------
Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München 
Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank 
Thürauf
>From ffa0ae6eeef3ad15d3f288283e4c477193052f1a Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <tho...@codesourcery.com>
Date: Fri, 9 Apr 2021 16:03:32 +0200
Subject: [PATCH] Add 'libgomp.oacc-c-c++-common/static-variable-1.c' [PR84991,
 PR84992, PR90779]

	libgomp/
	PR middle-end/84991
	PR middle-end/84992
	PR middle-end/90779
	* testsuite/libgomp.oacc-c-c++-common/static-variable-1.c: New.
---
 .../static-variable-1.c                       | 460 ++++++++++++++++++
 1 file changed, 460 insertions(+)
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/static-variable-1.c

diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/static-variable-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/static-variable-1.c
new file mode 100644
index 00000000000..1d415cdcf76
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/static-variable-1.c
@@ -0,0 +1,460 @@
+/* "Function scope" (top-level block scope) 'static' variables
+
+   ... inside OpenACC compute construct regions as well as OpenACC 'routine'.
+
+   This is to document/verify aspects of GCC's observed behavior, not
+   necessarily as it's (intended to be?) restricted by the OpenACC
+   specification.  See also PR84991, PR84992, PR90779 etc., and
+   <https://github.com/OpenACC/openacc-spec/issues/372> "C/C++ 'static'
+   variables" (only visible to members of the GitHub OpenACC organization).
+*/
+
+
+#undef NDEBUG
+#include <assert.h>
+#include <string.h>
+#include <openacc.h>
+#include <gomp-constants.h>
+
+
+#define IF_DEBUG if (0)
+
+
+/* Without explicit 'num_gangs'.  */
+
+static void t0_c(void)
+{
+  IF_DEBUG
+    __builtin_printf ("%s\n", __FUNCTION__);
+
+  const int i_limit = 11;
+  const int var_init = 16;
+
+  for (int i = 0; i < i_limit; ++i)
+    {
+      int result = 0;
+      int num_gangs_actual = -1;
+#pragma acc parallel \
+  reduction(max:num_gangs_actual) \
+  reduction(max:result)
+      {
+	num_gangs_actual = 1 + __builtin_goacc_parlevel_id(GOMP_DIM_GANG);
+
+	static int var = var_init;
+
+#pragma acc atomic capture
+	result = ++var;
+
+	/* Irrespective of the order in which the gang-redundant threads
+	   execute, 'var' has now been incremented 'num_gangs_actual' times, and
+	   the final value captured as 'result'.  */
+      }
+      /* Without an explicit 'num_gangs' clause GCC assigns 'num_gangs(1)'
+	 because it doesn't see any use of gang-level parallelism inside the
+	 region.  */
+      assert(num_gangs_actual == 1);
+      assert(result == var_init + num_gangs_actual * (1 + i));
+    }
+}
+
+
+/* Call a gang-level routine.  */
+
+static const int t0_r_var_init = 61;
+
+#pragma acc routine gang
+__attribute__((noinline))
+static int t0_r_r(void)
+{
+  static int var = t0_r_var_init;
+
+  int tmp;
+#pragma acc atomic capture
+  tmp = ++var;
+
+  return tmp;
+}
+
+static void t0_r(void)
+{
+  IF_DEBUG
+    __builtin_printf ("%s\n", __FUNCTION__);
+
+  const int i_limit = 11;
+
+  for (int i = 0; i < i_limit; ++i)
+    {
+      int result = 0;
+      int num_gangs_actual = -1;
+#pragma acc parallel \
+  reduction(max:num_gangs_actual) \
+  reduction(max:result)
+      {
+	num_gangs_actual = 1 + __builtin_goacc_parlevel_id(GOMP_DIM_GANG);
+
+	result = t0_r_r();
+
+	/* Irrespective of the order in which the gang-redundant threads
+	   execute, 'var' has now been incremented 'num_gangs_actual' times, and
+	   the final value captured as 'result'.  */
+      }
+      /* The number of gangs selected by the implemention ought to but must not
+	 be bigger than one.  */
+      IF_DEBUG
+	__builtin_printf ("%d: num_gangs_actual: %d\n", i, num_gangs_actual);
+      assert(num_gangs_actual >= 1);
+      assert(result == t0_r_var_init + num_gangs_actual * (1 + i));
+    }
+}
+
+
+/* Explicit 'num_gangs'.  */
+
+static void t1_c(void)
+{
+  IF_DEBUG
+    __builtin_printf ("%s\n", __FUNCTION__);
+
+  const int i_limit = 22;
+  const int num_gangs_request = 444;
+  const int var_init = 5;
+
+  for (int i = 0; i < i_limit; ++i)
+    {
+      int result = 0;
+      int num_gangs_actual = -1;
+#pragma acc parallel \
+  num_gangs(num_gangs_request) \
+  reduction(max:num_gangs_actual) \
+  reduction(max:result)
+      {
+	num_gangs_actual = 1 + __builtin_goacc_parlevel_id(GOMP_DIM_GANG);
+
+	static int var = var_init;
+
+#pragma acc atomic capture
+	result = ++var;
+
+	/* Irrespective of the order in which the gang-redundant threads
+	   execute, 'var' has now been incremented 'num_gangs_actual' times, and
+	   the final value captured as 'result'.  */
+      }
+      if (acc_get_device_type() == acc_device_host)
+	assert(num_gangs_actual == 1);
+      else
+	assert(num_gangs_actual == num_gangs_request);
+      assert(result == var_init + num_gangs_actual * (1 + i));
+    }
+}
+
+
+/* Check the same routine called from two compute constructs.  */
+
+static const int t1_r2_var_init = 166;
+
+#pragma acc routine gang
+__attribute__((noinline))
+static int t1_r2_r(void)
+{
+  static int var = t1_r2_var_init;
+
+  int tmp;
+#pragma acc atomic capture
+  tmp = ++var;
+
+  return tmp;
+}
+
+static void t1_r2(void)
+{
+  IF_DEBUG
+    __builtin_printf ("%s\n", __FUNCTION__);
+
+  const int i_limit = 71;
+  /* The checking assumes the same 'num_gangs' for all compute constructs.  */
+  const int num_gangs_request = 333;
+  int num_gangs_actual = -1;
+  if (acc_get_device_type() == acc_device_host)
+    num_gangs_actual = 1;
+  else
+    {
+      /* We're assuming that the implementation is able to accomodate the
+	 'num_gangs' requested (which really ought to be true for
+	 'num_gangs').  */
+      num_gangs_actual = num_gangs_request;
+    }
+
+  for (int i = 0; i < i_limit; ++i)
+    {
+      int result_1 = 0;
+#pragma acc parallel \
+  num_gangs(num_gangs_request) \
+  reduction(max:result_1)
+      {
+	result_1 = t1_r2_r();
+
+	/* Irrespective of the order in which the gang-redundant threads
+	   execute, 'var' has now been incremented 'num_gangs_actual' times, and
+	   the final value captured as 'result_1'.  */
+      }
+      IF_DEBUG
+	__builtin_printf ("%d: result_1: %d\n", i, result_1);
+      assert(result_1 == t1_r2_var_init + num_gangs_actual * (1 + (i * 3 + 0)));
+
+      int result_2 = 0;
+#pragma acc parallel \
+  num_gangs(num_gangs_request) \
+  reduction(max:result_2)
+      {
+	result_2 = t1_r2_r() + t1_r2_r();
+
+	/* Irrespective of the order in which the gang-redundant threads
+	   execute, 'var' has now been incremented '2 * num_gangs_actual' times.
+	   However, the order of the two 't1_r2_r' function calls is not
+	   synchronized (between different gang-redundant threads).  We thus
+	   cannot verify the actual 'result_2' values in this case.  */
+      }
+      IF_DEBUG
+	__builtin_printf ("%d: result_2: %d\n", i, result_2);
+      if (num_gangs_actual == 1)
+	/* Per the rationale above, only in this case we can check the actual
+	   result.  */
+	assert(result_2 == (t1_r2_var_init + num_gangs_actual * (1 + (i * 3 + 1))
+			    + t1_r2_var_init + num_gangs_actual * (1 + (i * 3 + 2))));
+      /* But we can generally check low and high limits.  */
+      {
+	/* Must be bigger than '2 * result_1'.  */
+	int c = 2 * result_1;
+	IF_DEBUG
+	  __builtin_printf ("  > %d\n", c);
+	assert(result_2 > c);
+      }
+      {
+	/* ..., but limited by the base value for next 'i'.  */
+	int c = 2 * (t1_r2_var_init + num_gangs_actual * (0 + ((i + 1) * 3 + 0)));
+	IF_DEBUG
+	  __builtin_printf ("  < %d\n", c);
+	assert(result_2 < c);
+      }
+    }
+}
+
+
+/* Asynchronous execution.  */
+
+static const int t2_var_init_2 = -55;
+
+#pragma acc routine gang
+__attribute__((noinline))
+static int t2_r(void)
+{
+  static int var = t2_var_init_2;
+
+  int tmp;
+#pragma acc atomic capture
+  tmp = ++var;
+
+  return tmp;
+}
+
+static void t2(void)
+{
+  IF_DEBUG
+    __builtin_printf ("%s\n", __FUNCTION__);
+
+  const int i_limit = 12;
+  const int num_gangs_request_1 = 14;
+  const int var_init_1 = 5;
+  int results_1[i_limit][num_gangs_request_1];
+  memset (results_1, 0, sizeof results_1);
+  const int num_gangs_request_2 = 5;
+  int results_2[i_limit][num_gangs_request_2];
+  memset (results_2, 0, sizeof results_2);
+  const int num_gangs_request_3 = 34;
+  const int var_init_3 = 1250;
+  int results_3[i_limit][num_gangs_request_3];
+  memset (results_3, 0, sizeof results_3);
+
+#pragma acc data \
+  copy(results_1, results_2, results_3)
+  {
+    for (int i = 0; i < i_limit; ++i)
+      {
+	/* The following 'async' clauses effect asynchronous execution, but
+	   using the same async-argument for each compute construct implies that
+	   the respective compute constructs' execution is synchronized with
+	   itself, meaning that all 'i = 0' execution has finished (on the
+	   device) before 'i = 1' is started (on the device), etc.  */
+
+#pragma acc parallel \
+  present(results_1) \
+  num_gangs(num_gangs_request_1) \
+  async(1)
+	{
+	  static int var = var_init_1;
+
+	  int tmp;
+#pragma acc atomic capture
+	  tmp = ++var;
+
+	  results_1[i][__builtin_goacc_parlevel_id(GOMP_DIM_GANG)] += tmp;
+	}
+
+#pragma acc parallel \
+  present(results_2) \
+  num_gangs(num_gangs_request_2) \
+  async(2)
+	{
+	  results_2[i][__builtin_goacc_parlevel_id(GOMP_DIM_GANG)] += t2_r();
+	}
+
+#pragma acc parallel \
+  present(results_3) \
+  num_gangs(num_gangs_request_3) \
+  async(3)
+	{
+	  static int var = var_init_3;
+
+	  int tmp;
+#pragma acc atomic capture
+	  tmp = ++var;
+
+	  results_3[i][__builtin_goacc_parlevel_id(GOMP_DIM_GANG)] += tmp;
+	}
+      }
+#pragma acc wait
+  }
+  int num_gangs_actual_1;
+  int num_gangs_actual_2;
+  int num_gangs_actual_3;
+  if (acc_get_device_type() == acc_device_host)
+    {
+      num_gangs_actual_1 = 1;
+      num_gangs_actual_2 = 1;
+      num_gangs_actual_3 = 1;
+    }
+  else
+    {
+      /* We're assuming that the implementation is able to accomodate the
+	 'num_gangs' requested (which really ought to be true for
+	 'num_gangs').  */
+      num_gangs_actual_1 = num_gangs_request_1;
+      num_gangs_actual_2 = num_gangs_request_2;
+      num_gangs_actual_3 = num_gangs_request_3;
+    }
+
+  /* For 'i = 0', 'results_*[i][0..num_gangs_actual_*]' are expected to each
+     contain one value of '(1 + var_init_*)..(var_init_* + num_gangs_actual_*)',
+     and so on for increasing 'i'.  Their order however is unspecified due to
+     the gang-redundant execution.  (Thus checking that their sums match.)  */
+
+  int result_1 = 0;
+  int result_2 = 0;
+  int result_3 = 0;
+  for (int i = 0; i < i_limit; ++i)
+    {
+      int result_1_ = 0;
+      for (int g = 0; g < num_gangs_actual_1; ++g)
+	{
+	  IF_DEBUG
+	    __builtin_printf ("results_1[%d][%d]: %d\n", i, g, results_1[i][g]);
+	  result_1_ += results_1[i][g];
+	}
+      IF_DEBUG
+	__builtin_printf ("%d result_1_: %d\n", i, result_1_);
+      assert (result_1_ == (((var_init_1 + num_gangs_actual_1 * (1 + i)) * (1 + var_init_1 + num_gangs_actual_1 * (1 + i)) / 2)
+			    - ((var_init_1 + num_gangs_actual_1 * (0 + i)) * (1 + var_init_1 + num_gangs_actual_1 * (0 + i)) / 2)));
+      result_1 += result_1_;
+
+      int result_2_ = 0;
+      for (int g = 0; g < num_gangs_actual_2; ++g)
+	{
+	  IF_DEBUG
+	    __builtin_printf ("results_2[%d][%d]: %d\n", i, g, results_2[i][g]);
+	  result_2_ += results_2[i][g];
+	}
+      IF_DEBUG
+	__builtin_printf ("%d result_2_: %d\n", i, result_2_);
+      assert (result_2_ == (((t2_var_init_2 + num_gangs_actual_2 * (1 + i)) * (1 + t2_var_init_2 + num_gangs_actual_2 * (1 + i)) / 2)
+			    - ((t2_var_init_2 + num_gangs_actual_2 * (0 + i)) * (1 + t2_var_init_2 + num_gangs_actual_2 * (0 + i)) / 2)));
+      result_2 += result_2_;
+
+      int result_3_ = 0;
+      for (int g = 0; g < num_gangs_actual_3; ++g)
+	{
+	  IF_DEBUG
+	    __builtin_printf ("results_3[%d][%d]: %d\n", i, g, results_3[i][g]);
+	  result_3_ += results_3[i][g];
+	}
+      IF_DEBUG
+	__builtin_printf ("%d result_3_: %d\n", i, result_3_);
+      assert (result_3_ == (((var_init_3 + num_gangs_actual_3 * (1 + i)) * (1 + var_init_3 + num_gangs_actual_3 * (1 + i)) / 2)
+			    - ((var_init_3 + num_gangs_actual_3 * (0 + i)) * (1 + var_init_3 + num_gangs_actual_3 * (0 + i)) / 2)));
+      result_3 += result_3_;
+    }
+  IF_DEBUG
+    __builtin_printf ("result_1: %d\n", result_1);
+  assert (result_1 == (((var_init_1 + num_gangs_actual_1 * i_limit) * (1 + var_init_1 + num_gangs_actual_1 * i_limit) / 2)
+		       - (var_init_1 * (var_init_1 + 1) / 2)));
+  IF_DEBUG
+    __builtin_printf ("result_2: %d\n", result_2);
+  assert (result_2 == (((t2_var_init_2 + num_gangs_actual_2 * i_limit) * (1 + t2_var_init_2 + num_gangs_actual_2 * i_limit) / 2)
+		       - (t2_var_init_2 * (t2_var_init_2 + 1) / 2)));
+  IF_DEBUG
+    __builtin_printf ("result_3: %d\n", result_3);
+  assert (result_3 == (((var_init_3 + num_gangs_actual_3 * i_limit) * (1 + var_init_3 + num_gangs_actual_3 * i_limit) / 2)
+		       - (var_init_3 * (var_init_3 + 1) / 2)));
+}
+
+
+#pragma acc routine seq
+__attribute__((noinline))
+static int pr84991_1_r_s(int n)
+{
+  static const int test[] = {1,2,3,4};
+  return test[n];
+}
+
+static void pr84991_1(void)
+{
+  int n[1];
+  n[0] = 3;
+#pragma acc parallel copy(n)
+  {
+    n[0] = pr84991_1_r_s(n[0]);
+  }
+  assert(n[0] == 4);
+}
+
+
+static void pr84992_1(void)
+{
+  int n[1];
+  n[0] = 3;
+#pragma acc parallel copy(n)
+  {
+    static const int test[] = {1,2,3,4};
+    n[0] = test[n[0]];
+  }
+  assert(n[0] == 4);
+}
+
+
+int main(void)
+{
+  t0_c();
+
+  t0_r();
+
+  t1_c();
+
+  t1_r2();
+
+  t2();
+
+  pr84991_1();
+
+  pr84992_1();
+
+  return 0;
+}
-- 
2.30.2

>From 60b589b5858fb8ad414583c6b493e0897f1bde5f Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <tho...@codesourcery.com>
Date: Fri, 9 Apr 2021 16:03:32 +0200
Subject: [PATCH] Add 'libgomp.oacc-c-c++-common/static-variable-1.c' [PR84991,
 PR84992, PR90779]

	libgomp/
	PR middle-end/84991
	PR middle-end/84992
	PR middle-end/90779
	* testsuite/libgomp.oacc-c-c++-common/static-variable-1.c: New.

(cherry picked from commit ffa0ae6eeef3ad15d3f288283e4c477193052f1a)
---
 .../static-variable-1.c                       | 460 ++++++++++++++++++
 1 file changed, 460 insertions(+)
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/static-variable-1.c

diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/static-variable-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/static-variable-1.c
new file mode 100644
index 00000000000..1d415cdcf76
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/static-variable-1.c
@@ -0,0 +1,460 @@
+/* "Function scope" (top-level block scope) 'static' variables
+
+   ... inside OpenACC compute construct regions as well as OpenACC 'routine'.
+
+   This is to document/verify aspects of GCC's observed behavior, not
+   necessarily as it's (intended to be?) restricted by the OpenACC
+   specification.  See also PR84991, PR84992, PR90779 etc., and
+   <https://github.com/OpenACC/openacc-spec/issues/372> "C/C++ 'static'
+   variables" (only visible to members of the GitHub OpenACC organization).
+*/
+
+
+#undef NDEBUG
+#include <assert.h>
+#include <string.h>
+#include <openacc.h>
+#include <gomp-constants.h>
+
+
+#define IF_DEBUG if (0)
+
+
+/* Without explicit 'num_gangs'.  */
+
+static void t0_c(void)
+{
+  IF_DEBUG
+    __builtin_printf ("%s\n", __FUNCTION__);
+
+  const int i_limit = 11;
+  const int var_init = 16;
+
+  for (int i = 0; i < i_limit; ++i)
+    {
+      int result = 0;
+      int num_gangs_actual = -1;
+#pragma acc parallel \
+  reduction(max:num_gangs_actual) \
+  reduction(max:result)
+      {
+	num_gangs_actual = 1 + __builtin_goacc_parlevel_id(GOMP_DIM_GANG);
+
+	static int var = var_init;
+
+#pragma acc atomic capture
+	result = ++var;
+
+	/* Irrespective of the order in which the gang-redundant threads
+	   execute, 'var' has now been incremented 'num_gangs_actual' times, and
+	   the final value captured as 'result'.  */
+      }
+      /* Without an explicit 'num_gangs' clause GCC assigns 'num_gangs(1)'
+	 because it doesn't see any use of gang-level parallelism inside the
+	 region.  */
+      assert(num_gangs_actual == 1);
+      assert(result == var_init + num_gangs_actual * (1 + i));
+    }
+}
+
+
+/* Call a gang-level routine.  */
+
+static const int t0_r_var_init = 61;
+
+#pragma acc routine gang
+__attribute__((noinline))
+static int t0_r_r(void)
+{
+  static int var = t0_r_var_init;
+
+  int tmp;
+#pragma acc atomic capture
+  tmp = ++var;
+
+  return tmp;
+}
+
+static void t0_r(void)
+{
+  IF_DEBUG
+    __builtin_printf ("%s\n", __FUNCTION__);
+
+  const int i_limit = 11;
+
+  for (int i = 0; i < i_limit; ++i)
+    {
+      int result = 0;
+      int num_gangs_actual = -1;
+#pragma acc parallel \
+  reduction(max:num_gangs_actual) \
+  reduction(max:result)
+      {
+	num_gangs_actual = 1 + __builtin_goacc_parlevel_id(GOMP_DIM_GANG);
+
+	result = t0_r_r();
+
+	/* Irrespective of the order in which the gang-redundant threads
+	   execute, 'var' has now been incremented 'num_gangs_actual' times, and
+	   the final value captured as 'result'.  */
+      }
+      /* The number of gangs selected by the implemention ought to but must not
+	 be bigger than one.  */
+      IF_DEBUG
+	__builtin_printf ("%d: num_gangs_actual: %d\n", i, num_gangs_actual);
+      assert(num_gangs_actual >= 1);
+      assert(result == t0_r_var_init + num_gangs_actual * (1 + i));
+    }
+}
+
+
+/* Explicit 'num_gangs'.  */
+
+static void t1_c(void)
+{
+  IF_DEBUG
+    __builtin_printf ("%s\n", __FUNCTION__);
+
+  const int i_limit = 22;
+  const int num_gangs_request = 444;
+  const int var_init = 5;
+
+  for (int i = 0; i < i_limit; ++i)
+    {
+      int result = 0;
+      int num_gangs_actual = -1;
+#pragma acc parallel \
+  num_gangs(num_gangs_request) \
+  reduction(max:num_gangs_actual) \
+  reduction(max:result)
+      {
+	num_gangs_actual = 1 + __builtin_goacc_parlevel_id(GOMP_DIM_GANG);
+
+	static int var = var_init;
+
+#pragma acc atomic capture
+	result = ++var;
+
+	/* Irrespective of the order in which the gang-redundant threads
+	   execute, 'var' has now been incremented 'num_gangs_actual' times, and
+	   the final value captured as 'result'.  */
+      }
+      if (acc_get_device_type() == acc_device_host)
+	assert(num_gangs_actual == 1);
+      else
+	assert(num_gangs_actual == num_gangs_request);
+      assert(result == var_init + num_gangs_actual * (1 + i));
+    }
+}
+
+
+/* Check the same routine called from two compute constructs.  */
+
+static const int t1_r2_var_init = 166;
+
+#pragma acc routine gang
+__attribute__((noinline))
+static int t1_r2_r(void)
+{
+  static int var = t1_r2_var_init;
+
+  int tmp;
+#pragma acc atomic capture
+  tmp = ++var;
+
+  return tmp;
+}
+
+static void t1_r2(void)
+{
+  IF_DEBUG
+    __builtin_printf ("%s\n", __FUNCTION__);
+
+  const int i_limit = 71;
+  /* The checking assumes the same 'num_gangs' for all compute constructs.  */
+  const int num_gangs_request = 333;
+  int num_gangs_actual = -1;
+  if (acc_get_device_type() == acc_device_host)
+    num_gangs_actual = 1;
+  else
+    {
+      /* We're assuming that the implementation is able to accomodate the
+	 'num_gangs' requested (which really ought to be true for
+	 'num_gangs').  */
+      num_gangs_actual = num_gangs_request;
+    }
+
+  for (int i = 0; i < i_limit; ++i)
+    {
+      int result_1 = 0;
+#pragma acc parallel \
+  num_gangs(num_gangs_request) \
+  reduction(max:result_1)
+      {
+	result_1 = t1_r2_r();
+
+	/* Irrespective of the order in which the gang-redundant threads
+	   execute, 'var' has now been incremented 'num_gangs_actual' times, and
+	   the final value captured as 'result_1'.  */
+      }
+      IF_DEBUG
+	__builtin_printf ("%d: result_1: %d\n", i, result_1);
+      assert(result_1 == t1_r2_var_init + num_gangs_actual * (1 + (i * 3 + 0)));
+
+      int result_2 = 0;
+#pragma acc parallel \
+  num_gangs(num_gangs_request) \
+  reduction(max:result_2)
+      {
+	result_2 = t1_r2_r() + t1_r2_r();
+
+	/* Irrespective of the order in which the gang-redundant threads
+	   execute, 'var' has now been incremented '2 * num_gangs_actual' times.
+	   However, the order of the two 't1_r2_r' function calls is not
+	   synchronized (between different gang-redundant threads).  We thus
+	   cannot verify the actual 'result_2' values in this case.  */
+      }
+      IF_DEBUG
+	__builtin_printf ("%d: result_2: %d\n", i, result_2);
+      if (num_gangs_actual == 1)
+	/* Per the rationale above, only in this case we can check the actual
+	   result.  */
+	assert(result_2 == (t1_r2_var_init + num_gangs_actual * (1 + (i * 3 + 1))
+			    + t1_r2_var_init + num_gangs_actual * (1 + (i * 3 + 2))));
+      /* But we can generally check low and high limits.  */
+      {
+	/* Must be bigger than '2 * result_1'.  */
+	int c = 2 * result_1;
+	IF_DEBUG
+	  __builtin_printf ("  > %d\n", c);
+	assert(result_2 > c);
+      }
+      {
+	/* ..., but limited by the base value for next 'i'.  */
+	int c = 2 * (t1_r2_var_init + num_gangs_actual * (0 + ((i + 1) * 3 + 0)));
+	IF_DEBUG
+	  __builtin_printf ("  < %d\n", c);
+	assert(result_2 < c);
+      }
+    }
+}
+
+
+/* Asynchronous execution.  */
+
+static const int t2_var_init_2 = -55;
+
+#pragma acc routine gang
+__attribute__((noinline))
+static int t2_r(void)
+{
+  static int var = t2_var_init_2;
+
+  int tmp;
+#pragma acc atomic capture
+  tmp = ++var;
+
+  return tmp;
+}
+
+static void t2(void)
+{
+  IF_DEBUG
+    __builtin_printf ("%s\n", __FUNCTION__);
+
+  const int i_limit = 12;
+  const int num_gangs_request_1 = 14;
+  const int var_init_1 = 5;
+  int results_1[i_limit][num_gangs_request_1];
+  memset (results_1, 0, sizeof results_1);
+  const int num_gangs_request_2 = 5;
+  int results_2[i_limit][num_gangs_request_2];
+  memset (results_2, 0, sizeof results_2);
+  const int num_gangs_request_3 = 34;
+  const int var_init_3 = 1250;
+  int results_3[i_limit][num_gangs_request_3];
+  memset (results_3, 0, sizeof results_3);
+
+#pragma acc data \
+  copy(results_1, results_2, results_3)
+  {
+    for (int i = 0; i < i_limit; ++i)
+      {
+	/* The following 'async' clauses effect asynchronous execution, but
+	   using the same async-argument for each compute construct implies that
+	   the respective compute constructs' execution is synchronized with
+	   itself, meaning that all 'i = 0' execution has finished (on the
+	   device) before 'i = 1' is started (on the device), etc.  */
+
+#pragma acc parallel \
+  present(results_1) \
+  num_gangs(num_gangs_request_1) \
+  async(1)
+	{
+	  static int var = var_init_1;
+
+	  int tmp;
+#pragma acc atomic capture
+	  tmp = ++var;
+
+	  results_1[i][__builtin_goacc_parlevel_id(GOMP_DIM_GANG)] += tmp;
+	}
+
+#pragma acc parallel \
+  present(results_2) \
+  num_gangs(num_gangs_request_2) \
+  async(2)
+	{
+	  results_2[i][__builtin_goacc_parlevel_id(GOMP_DIM_GANG)] += t2_r();
+	}
+
+#pragma acc parallel \
+  present(results_3) \
+  num_gangs(num_gangs_request_3) \
+  async(3)
+	{
+	  static int var = var_init_3;
+
+	  int tmp;
+#pragma acc atomic capture
+	  tmp = ++var;
+
+	  results_3[i][__builtin_goacc_parlevel_id(GOMP_DIM_GANG)] += tmp;
+	}
+      }
+#pragma acc wait
+  }
+  int num_gangs_actual_1;
+  int num_gangs_actual_2;
+  int num_gangs_actual_3;
+  if (acc_get_device_type() == acc_device_host)
+    {
+      num_gangs_actual_1 = 1;
+      num_gangs_actual_2 = 1;
+      num_gangs_actual_3 = 1;
+    }
+  else
+    {
+      /* We're assuming that the implementation is able to accomodate the
+	 'num_gangs' requested (which really ought to be true for
+	 'num_gangs').  */
+      num_gangs_actual_1 = num_gangs_request_1;
+      num_gangs_actual_2 = num_gangs_request_2;
+      num_gangs_actual_3 = num_gangs_request_3;
+    }
+
+  /* For 'i = 0', 'results_*[i][0..num_gangs_actual_*]' are expected to each
+     contain one value of '(1 + var_init_*)..(var_init_* + num_gangs_actual_*)',
+     and so on for increasing 'i'.  Their order however is unspecified due to
+     the gang-redundant execution.  (Thus checking that their sums match.)  */
+
+  int result_1 = 0;
+  int result_2 = 0;
+  int result_3 = 0;
+  for (int i = 0; i < i_limit; ++i)
+    {
+      int result_1_ = 0;
+      for (int g = 0; g < num_gangs_actual_1; ++g)
+	{
+	  IF_DEBUG
+	    __builtin_printf ("results_1[%d][%d]: %d\n", i, g, results_1[i][g]);
+	  result_1_ += results_1[i][g];
+	}
+      IF_DEBUG
+	__builtin_printf ("%d result_1_: %d\n", i, result_1_);
+      assert (result_1_ == (((var_init_1 + num_gangs_actual_1 * (1 + i)) * (1 + var_init_1 + num_gangs_actual_1 * (1 + i)) / 2)
+			    - ((var_init_1 + num_gangs_actual_1 * (0 + i)) * (1 + var_init_1 + num_gangs_actual_1 * (0 + i)) / 2)));
+      result_1 += result_1_;
+
+      int result_2_ = 0;
+      for (int g = 0; g < num_gangs_actual_2; ++g)
+	{
+	  IF_DEBUG
+	    __builtin_printf ("results_2[%d][%d]: %d\n", i, g, results_2[i][g]);
+	  result_2_ += results_2[i][g];
+	}
+      IF_DEBUG
+	__builtin_printf ("%d result_2_: %d\n", i, result_2_);
+      assert (result_2_ == (((t2_var_init_2 + num_gangs_actual_2 * (1 + i)) * (1 + t2_var_init_2 + num_gangs_actual_2 * (1 + i)) / 2)
+			    - ((t2_var_init_2 + num_gangs_actual_2 * (0 + i)) * (1 + t2_var_init_2 + num_gangs_actual_2 * (0 + i)) / 2)));
+      result_2 += result_2_;
+
+      int result_3_ = 0;
+      for (int g = 0; g < num_gangs_actual_3; ++g)
+	{
+	  IF_DEBUG
+	    __builtin_printf ("results_3[%d][%d]: %d\n", i, g, results_3[i][g]);
+	  result_3_ += results_3[i][g];
+	}
+      IF_DEBUG
+	__builtin_printf ("%d result_3_: %d\n", i, result_3_);
+      assert (result_3_ == (((var_init_3 + num_gangs_actual_3 * (1 + i)) * (1 + var_init_3 + num_gangs_actual_3 * (1 + i)) / 2)
+			    - ((var_init_3 + num_gangs_actual_3 * (0 + i)) * (1 + var_init_3 + num_gangs_actual_3 * (0 + i)) / 2)));
+      result_3 += result_3_;
+    }
+  IF_DEBUG
+    __builtin_printf ("result_1: %d\n", result_1);
+  assert (result_1 == (((var_init_1 + num_gangs_actual_1 * i_limit) * (1 + var_init_1 + num_gangs_actual_1 * i_limit) / 2)
+		       - (var_init_1 * (var_init_1 + 1) / 2)));
+  IF_DEBUG
+    __builtin_printf ("result_2: %d\n", result_2);
+  assert (result_2 == (((t2_var_init_2 + num_gangs_actual_2 * i_limit) * (1 + t2_var_init_2 + num_gangs_actual_2 * i_limit) / 2)
+		       - (t2_var_init_2 * (t2_var_init_2 + 1) / 2)));
+  IF_DEBUG
+    __builtin_printf ("result_3: %d\n", result_3);
+  assert (result_3 == (((var_init_3 + num_gangs_actual_3 * i_limit) * (1 + var_init_3 + num_gangs_actual_3 * i_limit) / 2)
+		       - (var_init_3 * (var_init_3 + 1) / 2)));
+}
+
+
+#pragma acc routine seq
+__attribute__((noinline))
+static int pr84991_1_r_s(int n)
+{
+  static const int test[] = {1,2,3,4};
+  return test[n];
+}
+
+static void pr84991_1(void)
+{
+  int n[1];
+  n[0] = 3;
+#pragma acc parallel copy(n)
+  {
+    n[0] = pr84991_1_r_s(n[0]);
+  }
+  assert(n[0] == 4);
+}
+
+
+static void pr84992_1(void)
+{
+  int n[1];
+  n[0] = 3;
+#pragma acc parallel copy(n)
+  {
+    static const int test[] = {1,2,3,4};
+    n[0] = test[n[0]];
+  }
+  assert(n[0] == 4);
+}
+
+
+int main(void)
+{
+  t0_c();
+
+  t0_r();
+
+  t1_c();
+
+  t1_r2();
+
+  t2();
+
+  pr84991_1();
+
+  pr84992_1();
+
+  return 0;
+}
-- 
2.30.2

Reply via email to