Re: [RFC] Enabling SVE with offloading to nvptx

Andrew Stubbs Mon, 04 Nov 2024 02:22:20 -0800

[...skip literally unreadable deeply nested conversation...]

A couple of years ago I posted a patch to this same code solving aperformance problem with x86_64/amdgcn offloading:

https://patchwork.sourceware.org/project/gcc/patch/0e1a740e-46d5-ebfa-36f4-9a069ddf8...@codesourcery.com/

At that time, the patch was rejected, and I didn't have time to make therequested edits due to higher priorities.

Co-incidentally, I just started working on this again a week or so ago.I now have a patch series nearly ready to go (I was just about to workon adding some testcases), but on reading the list this morning I findthat it conflicts with your patch already posted.

Some of the patch is nearly identical (such as a new IFN with theobvious not-an-expander), but not quite. Instead of adding a whole newpass I have simply enabled ompdevlow for this case. Also, I didn't needto do anything about SIMT for my usecase.


BabelStream "dot" benchmark (gfx90a):

Baseline:   364541 MBytes/sec
Your patch: 354892 MBytes/sec -- same, within noise
My patches: 574802 MBytes/sec -- 1.6x speedup[*]

So, your patch doesn't fix my problem, and I imagine my patch doesn'tfix your problem (because max_vf remains "1" when offloading to SIMTdevices).

Only patch 1/3 is actually needed to fix my benchmark. The other two areincreasingly thorough handling of the other cases.

To do this thing perfectly I think we need to delay the SIMT cases aswell, so as not to hurt AArch64 hosts, but I still need to figure outwhy your solution is not working for me.


Andrew

[*] My original post claimed a 10x speedup, but that was when amdgcnonly had V64 vector modes, so setting "max_vf = 16" resulted in totalvectorizer failure. Now that amdgcn has V16 modes the baseline result ismuch better, but max_vf really does need to be 64.

From 69db90d5639c4ce082136eee032ab63a61a32035 Mon Sep 17 00:00:00 2001
From: Andrew Stubbs <a...@baylibre.com>
Date: Mon, 21 Oct 2024 12:29:54 +0000
Subject: [PATCH 1/3] openmp: Tune omp_max_vf for offload targets

If requested, return the vectorization factor appropriate for the offload
device, if any.

This change gives a significant speedup in the BabelStream "dot" benchmark on
amdgcn.

The omp_adjust_chunk_size usecase is set "false", for now, but I intend to
change that in a follow-up patch.

Note that NVPTX SIMT offload does not use this code-path.

gcc/ChangeLog:

	* gimple-loop-versioning.cc (loop_versioning::loop_versioning): Set
	omp_max_vf to offload == false.
	* omp-expand.cc (omp_adjust_chunk_size): Likewise.
	* omp-general.cc (omp_max_vf): Add "offload" parameter, and detect
	amdgcn offload devices.
	* omp-general.h (omp_max_vf): Likewise.
	* omp-low.cc (lower_rec_simd_input_clauses): Pass offload state to
	omp_max_vf.
---
 gcc/gimple-loop-versioning.cc |  2 +-
 gcc/omp-expand.cc             |  2 +-
 gcc/omp-general.cc            | 17 +++++++++++++++--
 gcc/omp-general.h             |  2 +-
 gcc/omp-low.cc                |  3 ++-
 5 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/gcc/gimple-loop-versioning.cc b/gcc/gimple-loop-versioning.cc
index 107b0020024..2968c929d04 100644
--- a/gcc/gimple-loop-versioning.cc
+++ b/gcc/gimple-loop-versioning.cc
@@ -554,7 +554,7 @@ loop_versioning::loop_versioning (function *fn)
      handled efficiently by scalar code.  omp_max_vf calculates the
      maximum number of bytes in a vector, when such a value is relevant
      to loop optimization.  */
-  m_maximum_scale = estimated_poly_value (omp_max_vf ());
+  m_maximum_scale = estimated_poly_value (omp_max_vf (false));
   m_maximum_scale = MAX (m_maximum_scale, MAX_FIXED_MODE_SIZE);
 }
 
diff --git a/gcc/omp-expand.cc b/gcc/omp-expand.cc
index b0b4ddf5dbc..907fd46a5b2 100644
--- a/gcc/omp-expand.cc
+++ b/gcc/omp-expand.cc
@@ -212,7 +212,7 @@ omp_adjust_chunk_size (tree chunk_size, bool simd_schedule)
   if (!simd_schedule || integer_zerop (chunk_size))
     return chunk_size;
 
-  poly_uint64 vf = omp_max_vf ();
+  poly_uint64 vf = omp_max_vf (false);
   if (known_eq (vf, 1U))
     return chunk_size;
 
diff --git a/gcc/omp-general.cc b/gcc/omp-general.cc
index f74b9bf5e96..223f6037270 100644
--- a/gcc/omp-general.cc
+++ b/gcc/omp-general.cc
@@ -987,10 +987,11 @@ find_combined_omp_for (tree *tp, int *walk_subtrees, void *data)
   return NULL_TREE;
 }
 
-/* Return maximum possible vectorization factor for the target.  */
+/* Return maximum possible vectorization factor for the target, or for
+   the OpenMP offload target if one exists.  */
 
 poly_uint64
-omp_max_vf (void)
+omp_max_vf (bool offload)
 {
   if (!optimize
       || optimize_debug
@@ -999,6 +1000,18 @@ omp_max_vf (void)
 	  && OPTION_SET_P (flag_tree_loop_vectorize)))
     return 1;
 
+  if (ENABLE_OFFLOADING && offload)
+    {
+      for (const char *c = getenv ("OFFLOAD_TARGET_NAMES"); c;)
+	{
+	  if (startswith (c, "amdgcn"))
+	    return 64;
+	  else if ((c = strchr (c, ':')))
+	    c++;
+	}
+      /* Otherwise, fall through to host VF.  */
+    }
+
   auto_vector_modes modes;
   targetm.vectorize.autovectorize_vector_modes (&modes, true);
   if (!modes.is_empty ())
diff --git a/gcc/omp-general.h b/gcc/omp-general.h
index f3778131626..70f78d2055b 100644
--- a/gcc/omp-general.h
+++ b/gcc/omp-general.h
@@ -162,7 +162,7 @@ extern void omp_extract_for_data (gomp_for *for_stmt, struct omp_for_data *fd,
 				  struct omp_for_data_loop *loops);
 extern gimple *omp_build_barrier (tree lhs);
 extern tree find_combined_omp_for (tree *, int *, void *);
-extern poly_uint64 omp_max_vf (void);
+extern poly_uint64 omp_max_vf (bool);
 extern int omp_max_simt_vf (void);
 extern const char *omp_context_name_list_prop (tree);
 extern void omp_construct_traits_to_codes (tree, int, enum tree_code *);
diff --git a/gcc/omp-low.cc b/gcc/omp-low.cc
index 44c4310075b..70a2c108fbc 100644
--- a/gcc/omp-low.cc
+++ b/gcc/omp-low.cc
@@ -4589,7 +4589,8 @@ lower_rec_simd_input_clauses (tree new_var, omp_context *ctx,
 {
   if (known_eq (sctx->max_vf, 0U))
     {
-      sctx->max_vf = sctx->is_simt ? omp_max_simt_vf () : omp_max_vf ();
+      sctx->max_vf = (sctx->is_simt ? omp_max_simt_vf ()
+		      : omp_max_vf (omp_maybe_offloaded_ctx (ctx)));
       if (maybe_gt (sctx->max_vf, 1U))
 	{
 	  tree c = omp_find_clause (gimple_omp_for_clauses (ctx->stmt),
-- 
2.46.0

From 9799c5153e10fee8f2a7c493d1faf74b31420d5f Mon Sep 17 00:00:00 2001
From: Andrew Stubbs <a...@baylibre.com>
Date: Fri, 1 Nov 2024 13:53:34 +0000
Subject: [PATCH 2/3] openmp: use offload max_vf for chunk_size

The chunk size for SIMD loops should be right for the current device; too big
allocates too much memory, too small is inefficient.  Getting it wrong doesn't
actually break anything though.

This patch attempts to choose the optimal setting based on the context.  Both
host-fallback and device will get the same chunk size, but device performance
is the most important in this case.

gcc/ChangeLog:

	* omp-expand.cc (is_in_offload_region): New function.
	(omp_adjust_chunk_size): Add pass-through "offload" parameter.
	(get_ws_args_for): Likewise.
	(determine_parallel_type): Use is_in_offload_region to adjust call to
	get_ws_args_for.
	(expand_omp_for_generic): Likewise.
	(expand_omp_for_static_chunk): Likewise.
---
 gcc/omp-expand.cc | 36 ++++++++++++++++++++++++++++--------
 1 file changed, 28 insertions(+), 8 deletions(-)

diff --git a/gcc/omp-expand.cc b/gcc/omp-expand.cc
index 907fd46a5b2..b0f9d375b6c 100644
--- a/gcc/omp-expand.cc
+++ b/gcc/omp-expand.cc
@@ -127,6 +127,23 @@ is_combined_parallel (struct omp_region *region)
   return region->is_combined_parallel;
 }
 
+/* Return true is REGION is or is contained within an offload region.  */
+
+static bool
+is_in_offload_region (struct omp_region *region)
+{
+  gimple *entry_stmt = last_nondebug_stmt (region->entry);
+  if (is_gimple_omp (entry_stmt)
+      && is_gimple_omp_offloaded (entry_stmt))
+    return true;
+  else if (region->outer)
+    return is_in_offload_region (region->outer);
+  else
+    return (lookup_attribute ("omp declare target",
+			      DECL_ATTRIBUTES (current_function_decl))
+	    != NULL);
+}
+
 /* Given two blocks PAR_ENTRY_BB and WS_ENTRY_BB such that WS_ENTRY_BB
    is the immediate dominator of PAR_ENTRY_BB, return true if there
    are no data dependencies that would prevent expanding the parallel
@@ -207,12 +224,12 @@ workshare_safe_to_combine_p (basic_block ws_entry_bb)
    presence (SIMD_SCHEDULE).  */
 
 static tree
-omp_adjust_chunk_size (tree chunk_size, bool simd_schedule)
+omp_adjust_chunk_size (tree chunk_size, bool simd_schedule, bool offload)
 {
   if (!simd_schedule || integer_zerop (chunk_size))
     return chunk_size;
 
-  poly_uint64 vf = omp_max_vf (false);
+  poly_uint64 vf = omp_max_vf (offload);
   if (known_eq (vf, 1U))
     return chunk_size;
 
@@ -228,7 +245,7 @@ omp_adjust_chunk_size (tree chunk_size, bool simd_schedule)
    expanded.  */
 
 static vec<tree, va_gc> *
-get_ws_args_for (gimple *par_stmt, gimple *ws_stmt)
+get_ws_args_for (gimple *par_stmt, gimple *ws_stmt, bool offload)
 {
   tree t;
   location_t loc = gimple_location (ws_stmt);
@@ -270,7 +287,7 @@ get_ws_args_for (gimple *par_stmt, gimple *ws_stmt)
       if (fd.chunk_size)
 	{
 	  t = fold_convert_loc (loc, long_integer_type_node, fd.chunk_size);
-	  t = omp_adjust_chunk_size (t, fd.simd_schedule);
+	  t = omp_adjust_chunk_size (t, fd.simd_schedule, offload);
 	  ws_args->quick_push (t);
 	}
 
@@ -366,7 +383,8 @@ determine_parallel_type (struct omp_region *region)
 
       region->is_combined_parallel = true;
       region->inner->is_combined_parallel = true;
-      region->ws_args = get_ws_args_for (par_stmt, ws_stmt);
+      region->ws_args = get_ws_args_for (par_stmt, ws_stmt,
+					 is_in_offload_region (region));
     }
 }
 
@@ -3929,6 +3947,7 @@ expand_omp_for_generic (struct omp_region *region,
   tree *counts = NULL;
   int i;
   bool ordered_lastprivate = false;
+  bool offload = is_in_offload_region (region);
 
   gcc_assert (!broken_loop || !in_combined_parallel);
   gcc_assert (fd->iter_type == long_integer_type_node
@@ -4196,7 +4215,7 @@ expand_omp_for_generic (struct omp_region *region,
 	  if (fd->chunk_size)
 	    {
 	      t = fold_convert (fd->iter_type, fd->chunk_size);
-	      t = omp_adjust_chunk_size (t, fd->simd_schedule);
+	      t = omp_adjust_chunk_size (t, fd->simd_schedule, offload);
 	      if (sched_arg)
 		{
 		  if (fd->ordered)
@@ -4240,7 +4259,7 @@ expand_omp_for_generic (struct omp_region *region,
 	    {
 	      tree bfn_decl = builtin_decl_explicit (start_fn);
 	      t = fold_convert (fd->iter_type, fd->chunk_size);
-	      t = omp_adjust_chunk_size (t, fd->simd_schedule);
+	      t = omp_adjust_chunk_size (t, fd->simd_schedule, offload);
 	      if (sched_arg)
 		t = build_call_expr (bfn_decl, 10, t5, t0, t1, t2, sched_arg,
 				     t, t3, t4, reductions, mem);
@@ -5937,7 +5956,8 @@ expand_omp_for_static_chunk (struct omp_region *region,
   step = force_gimple_operand_gsi (&gsi, fold_convert (itype, step),
 				   true, NULL_TREE, true, GSI_SAME_STMT);
   tree chunk_size = fold_convert (itype, fd->chunk_size);
-  chunk_size = omp_adjust_chunk_size (chunk_size, fd->simd_schedule);
+  chunk_size = omp_adjust_chunk_size (chunk_size, fd->simd_schedule,
+				      is_in_offload_region (region));
   chunk_size
     = force_gimple_operand_gsi (&gsi, chunk_size, true, NULL_TREE, true,
 				GSI_SAME_STMT);
-- 
2.46.0

From e09e01a6161a83cc1a07f49620831e3bb0d0a629 Mon Sep 17 00:00:00 2001
From: Andrew Stubbs <a...@baylibre.com>
Date: Fri, 1 Nov 2024 15:00:25 +0000
Subject: [PATCH 3/3] openmp: Add IFN_GOMP_MAX_VF

Delay omp_max_vf call until after the host and device compilers have diverged
so that the max_vf value can be tuned exactly right on both variants.

This change means that the ompdevlow pass must be enabled for functions that
use OpenMP directives with both "simd" and "schedule" enabled.

gcc/ChangeLog:

	* internal-fn.cc (expand_GOMP_MAX_VF): New function.
	* internal-fn.def (GOMP_MAX_VF): New internal function.
	* omp-expand.cc (omp_adjust_chunk_size): Emit IFN_GOMP_MAX_VF when
	called in offload context, otherwise assume host context.
	* omp-offload.cc (execute_omp_device_lower): Expand IFN_GOMP_MAX_VF.
---
 gcc/internal-fn.cc  |  8 ++++++++
 gcc/internal-fn.def |  1 +
 gcc/omp-expand.cc   | 10 +++++++++-
 gcc/omp-offload.cc  |  3 +++
 4 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
index 1b3fe7be047..0ee5f5bc7c5 100644
--- a/gcc/internal-fn.cc
+++ b/gcc/internal-fn.cc
@@ -510,6 +510,14 @@ expand_GOMP_SIMT_VF (internal_fn, gcall *)
 
 /* This should get expanded in omp_device_lower pass.  */
 
+static void
+expand_GOMP_MAX_VF (internal_fn, gcall *)
+{
+  gcc_unreachable ();
+}
+
+/* This should get expanded in omp_device_lower pass.  */
+
 static void
 expand_GOMP_TARGET_REV (internal_fn, gcall *)
 {
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 2d455938271..c3d0efc0f2c 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -465,6 +465,7 @@ DEF_INTERNAL_FN (GOMP_SIMT_ENTER_ALLOC, ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (GOMP_SIMT_EXIT, ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (GOMP_SIMT_LANE, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (GOMP_SIMT_VF, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (GOMP_MAX_VF, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (GOMP_SIMT_LAST_LANE, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (GOMP_SIMT_ORDERED_PRED, ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (GOMP_SIMT_VOTE_ANY, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
diff --git a/gcc/omp-expand.cc b/gcc/omp-expand.cc
index b0f9d375b6c..0f31108c156 100644
--- a/gcc/omp-expand.cc
+++ b/gcc/omp-expand.cc
@@ -229,7 +229,15 @@ omp_adjust_chunk_size (tree chunk_size, bool simd_schedule, bool offload)
   if (!simd_schedule || integer_zerop (chunk_size))
     return chunk_size;
 
-  poly_uint64 vf = omp_max_vf (offload);
+  if (offload)
+    {
+      cfun->curr_properties &= ~PROP_gimple_lomp_dev;
+      tree vf = build_call_expr_internal_loc (UNKNOWN_LOCATION, IFN_GOMP_MAX_VF,
+					      unsigned_type_node, 0);
+      return fold_convert (TREE_TYPE (chunk_size), vf);
+    }
+
+  poly_uint64 vf = omp_max_vf (false);
   if (known_eq (vf, 1U))
     return chunk_size;
 
diff --git a/gcc/omp-offload.cc b/gcc/omp-offload.cc
index 25ce8133fe5..372b019f9d6 100644
--- a/gcc/omp-offload.cc
+++ b/gcc/omp-offload.cc
@@ -2754,6 +2754,9 @@ execute_omp_device_lower ()
 	  case IFN_GOMP_SIMT_VF:
 	    rhs = build_int_cst (type, vf);
 	    break;
+	  case IFN_GOMP_MAX_VF:
+	    rhs = build_int_cst (type, omp_max_vf (false));
+	    break;
 	  case IFN_GOMP_SIMT_ORDERED_PRED:
 	    rhs = vf == 1 ? integer_zero_node : NULL_TREE;
 	    if (rhs || !lhs)
-- 
2.46.0

Re: [RFC] Enabling SVE with offloading to nvptx

Reply via email to