RE: [RFC] Enabling SVE with offloading to nvptx

Prathamesh Kulkarni via Gcc Thu, 14 Nov 2024 00:30:06 -0800


> -----Original Message-----
> From: Andrew Stubbs <a...@baylibre.com>
> Sent: 12 November 2024 20:23
> To: Prathamesh Kulkarni <prathame...@nvidia.com>; Jakub Jelinek
> <ja...@redhat.com>
> Cc: Richard Biener <richard.guent...@gmail.com>; Richard Biener
> <rguent...@suse.de>; gcc@gcc.gnu.org; Thomas Schwinge
> <tschwi...@baylibre.com>
> Subject: Re: [RFC] Enabling SVE with offloading to nvptx
> 
> External email: Use caution opening links or attachments
> 
> 
> On 12/11/2024 06:01, Prathamesh Kulkarni via Gcc wrote:
> >
> >
> >> -----Original Message-----
> >> From: Jakub Jelinek <ja...@redhat.com>
> >> Sent: 04 November 2024 21:44
> >> To: Prathamesh Kulkarni <prathame...@nvidia.com>
> >> Cc: Richard Biener <richard.guent...@gmail.com>; Richard Biener
> >> <rguent...@suse.de>; gcc@gcc.gnu.org; Thomas Schwinge
> >> <tschwi...@baylibre.com>
> >> Subject: Re: [RFC] Enabling SVE with offloading to nvptx
> >>
> >> External email: Use caution opening links or attachments
> >>
> >>
> >> On Sat, Nov 02, 2024 at 03:53:34PM +0000, Prathamesh Kulkarni
> wrote:
> >>> The attached patch adds a new bitfield needs_max_vf_lowering to
> >> loop,
> >>> and sets that in expand_omp_simd for loops that need delayed
> >> lowering
> >>> of safelen and omp simd arrays.  The patch defines a new macro
> >>> OMP_COMMON_MAX_VF (arbitrarily set to 16), as a placeholder value
> >> for
> >>> max_vf (instead of INT_MAX), and is later replaced by appropriate
> >>> max_vf during omp_adjust_max_vf pass.  Does that look OK ?
> >>
> >> No.
> >> The thing is, if user doesn't specify safelen, it defaults to
> >> infinity (which we represent as INT_MAX), if user specifies it,
> then
> >> that is the maximum for it (currently in OpenMP specification it is
> >> just an integral value, so can't be a poly int).
> >> And then the lowering uses the max_vf as another limit, what the hw
> >> can do at most and sizes the magic arrays with it.  So, one needs
> to
> >> use minimum of what user specified and what the hw can handle.
> >> So using 16 as some magic value is just wrong, safelen(16) can be
> >> specified in the source as well, or safelen(8), or safelen(32) or
> >> safelen(123).
> >>
> >> Thus, the fact that the hw minimum hasn't been determined yet needs
> >> to be represented in some other flag, not in loop->safelen value,
> and
> >> before that is determined, loop->safelen should then represent what
> >> the user wrote (or was implied) and the later pass should use
> minimum
> >> from loop->safelen and the picked hw maximum.  Of course if the
> >> picked hw maximum is POLY_INT-ish, the big question is how to
> compare
> >> that against the user supplied integer value, either one can just
> >> handle the INT_MAX (aka
> >> infinity) special case, or say query the backend on what is the
> >> maximum value of the POLY_INT at runtime and only use the POLY_INT
> if
> >> it is always known to be smaller or equal to the user supplied
> >> safelen.
> >>
> >> Another thing (already mentioned in the thread Andrew referenced)
> is
> >> that max_vf is used in two separate places.  One is just to size of
> >> the magic arrays and one of the operands of the minimum (the other
> is
> >> user specified safelen).  In this case, it is generally just fine
> to
> >> pick later value than strictly necessary (as long as it is never
> >> larger than user supplied safelen).
> >> The other case is simd modifier on schedule clause.  That value
> >> should better be the right one or slightly larger, but not too
> much.
> >> I think currently we just use the INTEGER_CST we pick as the
> maximum,
> >> if this sizing is deferred, maybe it needs to be another internal
> >> function that asks the value (though, it can refer to a loop vf in
> >> another function, which complicates stuff).
> >>
> >> Regarding Richi's question, I'm afraid the OpenMP simd loop
> lowering
> >> can't be delayed until some later pass.
> > Hi Jakub,
> > Thanks for the suggestions! The attached patch makes the following
> changes:
> > (1) Delays setting of safelen for offloading by introducing a new
> > bitfield needs_max_vf_lowering in loop, which is true with
> offloading enabled, and safelen is then set to min(safelen, max_vf)
> for the target later in omp_device_lower pass.
> > Comparing user-specified safelen with poly_int max_vf may not be
> > always possible at compile-time (say 32 and 16+16x), and even if we
> determine runtime VL based on -mcpu flags, I guess relying on that
> won't be portable ?
> > The patch works around this by taking constant_lower_bound (max_vf),
> > and comparing it with safelen instead, with the downside that
> constant_lower_bound(max_vf) will not be the optimal max_vf for SVE
> target if it implements SIMD width > 128 bits.
> >
> > (2) Since max_vf is used as length of omp simd array, it gets
> streamed
> > out to device, and device compilation fails during streaming-in if
> > max_vf is poly_int (16+16x), and device's NUM_POLY_INT_COEFFS < 2
> (which motivated my patch). The patch tries to address this by simply
> setting length to a placeholder value (INT_MAX?) in
> lower_rec_simd_input_clauses if offloading is enabled, and will be
> later set to appropriate value in omp_device_lower pass.
> >
> > (3) Andrew's patches seems to already fix the case for adjusting
> > chunk_size for schedule clause with simd modifier by introducing a
> new internal function .GOMP_MAX_VF, which is then replaced by target's
> max_vf. To keep it consistent with safelen, the patch here uses
> constant_lower_bound (max_vf) too.
> >
> > Patch passes libgomp testing for AArch64/nvptx offloading (with and
> without GPU).
> > Does it look OK ?
> 
> I've not reviewed the patch in detail, but I can confirm that this
> does not break my usecase or cause any test regressions, for me.
Hi Andrew,
Thanks for testing the patch!
> 
> However, are you sure that ompdevlow is always running? I think you
> need to add this, somewhere:
> 
>    cfun->curr_properties &= ~PROP_gimple_lomp_dev;
> 
> My patch added this into omp_adjust_chunk_size, which triggers it for
> certain schedule clauses, but I think you want it whenever safelen is
> needed?
Ah indeed, thanks for the suggestions.
I have tried to fix it in the attached patch.


Thanks,
Prathamesh
> 
> Andrew
> 
> >
> > Thanks,
> > Prathamesh
> >>
> >>          Jakub
> >

Delay lowering safelen if offloading is enabled.

gcc/ChangeLog:

        * cfgloop.h (loop): New bitfield needs_max_vf_lowering.
        * omp-expand.cc (expand_omp_simd): Set loop->needs_max_vf_lowering to 
result
        of is_in_offload_region.
        * omp-low.cc (lower_rec_simd_input_clauses): Set a placeholder value
        for length of omp simd array if offloading is enabled and max_vf is
        non-constant poly_int.
        * omp-offload.cc (adjust_max_vf): New function.
        (execute_omp_device_lower): Call adjust_max_vf, and use
        constant_lower_bound on result of omp_max_vf while replacing call
        to .GOMP_MAX_VF.

Signed-off-by: Prathamesh Kulkarni <prathame...@nvidia.com> 

diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index 30b5e40d0d9..f7b57e594fd 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -233,6 +233,12 @@ public:
      flag_finite_loops or similar pragmas state.  */
   unsigned finite_p : 1;
 
+  /* True if SIMD loop needs delayed lowering of artefacts like
+     safelen and length of omp simd arrays that depend on target's
+     max_vf. This is true for offloading, when max_vf is computed after
+     streaming out to device.  */
+  unsigned needs_max_vf_lowering: 1;
+
   /* The number of times to unroll the loop.  0 means no information given,
      just do what we always do.  A value of 1 means do not unroll the loop.
      A value of USHRT_MAX means unroll with no specific unrolling factor.
diff --git a/gcc/omp-expand.cc b/gcc/omp-expand.cc
index 80fb1843445..c9581e3d92e 100644
--- a/gcc/omp-expand.cc
+++ b/gcc/omp-expand.cc
@@ -7171,6 +7171,10 @@ expand_omp_simd (struct omp_region *region, struct 
omp_for_data *fd)
       loop->latch = cont_bb;
       add_loop (loop, l1_bb->loop_father);
       loop->safelen = safelen_int;
+      loop->needs_max_vf_lowering = is_in_offload_region (region);
+      if (loop->needs_max_vf_lowering)
+       cfun->curr_properties &= ~PROP_gimple_lomp_dev;
+
       if (simduid)
        {
          loop->simduid = OMP_CLAUSE__SIMDUID__DECL (simduid);
diff --git a/gcc/omp-low.cc b/gcc/omp-low.cc
index 70a2c108fbc..0f68876a2bc 100644
--- a/gcc/omp-low.cc
+++ b/gcc/omp-low.cc
@@ -4660,7 +4660,16 @@ lower_rec_simd_input_clauses (tree new_var, omp_context 
*ctx,
     }
   else
     {
-      tree atype = build_array_type_nelts (TREE_TYPE (new_var), sctx->max_vf);
+      poly_uint64 nelts;
+      /* FIXME: If offloading is enabled, and max_vf is poly_int, avoid
+        using it as length of omp simd array, and use a placeholder value
+        instead to avoid streaming out poly_int to device. The arrays
+        will be set to correct length later in omp_device_lower pass.  */
+      if (omp_maybe_offloaded_ctx (ctx) && !sctx->max_vf.is_constant ())
+       nelts = INT_MAX;
+      else
+       nelts = sctx->max_vf;
+      tree atype = build_array_type_nelts (TREE_TYPE (new_var), nelts);
       tree avar = create_tmp_var_raw (atype);
       if (TREE_ADDRESSABLE (new_var))
        TREE_ADDRESSABLE (avar) = 1;
diff --git a/gcc/omp-offload.cc b/gcc/omp-offload.cc
index 372b019f9d6..9d6467cc6a6 100644
--- a/gcc/omp-offload.cc
+++ b/gcc/omp-offload.cc
@@ -2618,6 +2618,75 @@ find_simtpriv_var_op (tree *tp, int *walk_subtrees, void 
*)
   return NULL_TREE;
 }
 
+/* Compute max_vf for target, and accordingly set loop->safelen and length
+   of omp simd arrays.  */
+
+static void
+adjust_max_vf (function *fun)
+{
+  if (!fun->has_simduid_loops)
+    return;
+
+  poly_uint64 max_vf = omp_max_vf (false);
+
+  /* FIXME: Since loop->safelen has to be an integer, it's not always possible
+     to compare against poly_int. For eg 32 and 16+16x are not comparable at
+     compile-time because 16+16x <= 32 for x < 2, but 16+16x > 32 for x >= 2.
+     Even if we could get runtime VL based on -mcpu/-march, that would not be
+     portable across other SVE archs.
+
+     For now, use constant_lower_bound (max_vf), as a "safer approximation" to
+     max_vf that avoids these issues, with the downside that it will be
+     suboptimal max_vf for SVE archs implementing SIMD width > 128 bits.  */
+
+  uint64_t max_vf_int;
+  if (!max_vf.is_constant (&max_vf_int))
+    max_vf_int = constant_lower_bound (max_vf);
+
+  calculate_dominance_info (CDI_DOMINATORS);
+  for (auto loop: loops_list (fun, 0))
+    {
+      if (!loop->needs_max_vf_lowering)
+       continue;
+
+      if (loop->safelen > max_vf_int)
+       loop->safelen = max_vf_int;
+
+      basic_block *bbs = get_loop_body (loop);
+      for (unsigned i = 0; i < loop->num_nodes; i++)
+       for (auto gsi = gsi_start_bb (bbs[i]); !gsi_end_p (gsi); gsi_next 
(&gsi))
+         {
+           gcall *stmt = dyn_cast<gcall *> (gsi_stmt (gsi));
+           tree lhs = NULL_TREE;
+
+           if (stmt
+               && gimple_call_internal_p (stmt)
+               && (gimple_call_internal_fn (stmt) == IFN_GOMP_SIMD_LANE)
+               && ((lhs = gimple_call_lhs (stmt)) != NULL_TREE))
+             {
+               imm_use_iterator use_iter;
+               gimple *use_stmt;
+
+               FOR_EACH_IMM_USE_STMT (use_stmt, use_iter, lhs)
+                 {
+                   gassign *assign_stmt = dyn_cast<gassign *> (use_stmt);
+                   if (!assign_stmt)
+                     continue;
+
+                   tree assign_stmt_lhs = gimple_assign_lhs (assign_stmt);
+                   if (TREE_CODE (assign_stmt_lhs) == ARRAY_REF)
+                     {
+                       tree array = TREE_OPERAND (assign_stmt_lhs, 0);
+                       tree& max = TYPE_MAX_VALUE (TYPE_DOMAIN (TREE_TYPE 
(array)));
+                       max = size_int (loop->safelen - 1);
+                       relayout_decl (array);
+                     }
+                 }
+             }
+         }
+    }
+}
+
 /* Cleanup uses of SIMT placeholder internal functions: on non-SIMT targets,
    VF is 1 and LANE is 0; on SIMT targets, VF is folded to a constant, and
    LANE is kept to be expanded to RTL later on.  Also cleanup all other SIMT
@@ -2627,6 +2696,8 @@ find_simtpriv_var_op (tree *tp, int *walk_subtrees, void 
*)
 static unsigned int
 execute_omp_device_lower ()
 {
+  adjust_max_vf (cfun);
+
   int vf = targetm.simt.vf ? targetm.simt.vf () : 1;
   bool regimplify = false;
   basic_block bb;
@@ -2755,7 +2826,10 @@ execute_omp_device_lower ()
            rhs = build_int_cst (type, vf);
            break;
          case IFN_GOMP_MAX_VF:
-           rhs = build_int_cst (type, omp_max_vf (false));
+           /* Use constant_lower_bound, to keep max_vf consistent with
+              adjut_max_vf above.  */
+           rhs = build_int_cst (type,
+                                constant_lower_bound (omp_max_vf (false)));
            break;
          case IFN_GOMP_SIMT_ORDERED_PRED:
            rhs = vf == 1 ? integer_zero_node : NULL_TREE;

RE: [RFC] Enabling SVE with offloading to nvptx

Reply via email to