RE: [RFC] Enabling SVE with offloading to nvptx

Prathamesh Kulkarni via Gcc Tue, 04 Feb 2025 00:32:31 -0800

> -----Original Message-----
> From: Gcc <gcc-bounces~prathameshk=nvidia....@gcc.gnu.org> On Behalf
> Of Prathamesh Kulkarni via Gcc
> Sent: 21 January 2025 17:05
> To: Jakub Jelinek <ja...@redhat.com>
> Cc: Andrew Stubbs <a...@baylibre.com>; Richard Biener
> <richard.guent...@gmail.com>; Richard Biener <rguent...@suse.de>;
> gcc@gcc.gnu.org; Thomas Schwinge <tschwi...@baylibre.com>
> Subject: RE: [RFC] Enabling SVE with offloading to nvptx
> 
> External email: Use caution opening links or attachments
> 
> 
> > -----Original Message-----
> > From: Prathamesh Kulkarni <prathame...@nvidia.com>
> > Sent: 08 January 2025 15:22
> > To: Prathamesh Kulkarni <prathame...@nvidia.com>; Jakub Jelinek
> > <ja...@redhat.com>
> > Cc: Andrew Stubbs <a...@baylibre.com>; Richard Biener
> > <richard.guent...@gmail.com>; Richard Biener <rguent...@suse.de>;
> > gcc@gcc.gnu.org; Thomas Schwinge <tschwi...@baylibre.com>
> > Subject: RE: [RFC] Enabling SVE with offloading to nvptx
> >
> >
> >
> > > -----Original Message-----
> > > From: Gcc <gcc-bounces~prathameshk=nvidia....@gcc.gnu.org> On
> Behalf
> > > Of Prathamesh Kulkarni via Gcc
> > > Sent: 27 December 2024 18:00
> > > To: Jakub Jelinek <ja...@redhat.com>
> > > Cc: Andrew Stubbs <a...@baylibre.com>; Richard Biener
> > > <richard.guent...@gmail.com>; Richard Biener <rguent...@suse.de>;
> > > gcc@gcc.gnu.org; Thomas Schwinge <tschwi...@baylibre.com>
> > > Subject: RE: [RFC] Enabling SVE with offloading to nvptx
> > >
> > > External email: Use caution opening links or attachments
> > >
> > >
> > > > -----Original Message-----
> > > > From: Jakub Jelinek <ja...@redhat.com>
> > > > Sent: 17 December 2024 19:09
> > > > To: Prathamesh Kulkarni <prathame...@nvidia.com>
> > > > Cc: Andrew Stubbs <a...@baylibre.com>; Richard Biener
> > > > <richard.guent...@gmail.com>; Richard Biener
> <rguent...@suse.de>;
> > > > gcc@gcc.gnu.org; Thomas Schwinge <tschwi...@baylibre.com>
> > > > Subject: Re: [RFC] Enabling SVE with offloading to nvptx
> > > >
> > > > External email: Use caution opening links or attachments
> > > >
> > > >
> > > > On Mon, Dec 02, 2024 at 11:17:08AM +0000, Prathamesh Kulkarni
> > wrote:
> > > > > --- a/gcc/cfgloop.h
> > > > > +++ b/gcc/cfgloop.h
> > > > > @@ -233,6 +233,12 @@ public:
> > > > >       flag_finite_loops or similar pragmas state.  */
> > > > >    unsigned finite_p : 1;
> > > > >
> > > > > +  /* True if SIMD loop needs delayed lowering of artefacts
> like
> > > > > +     safelen and length of omp simd arrays that depend on
> > > target's
> > > > > +     max_vf.  This is true for offloading, when max_vf is
> > > computed
> > > > after
> > > > > +     streaming out to device.  */  unsigned
> > > > > + needs_max_vf_lowering: 1;
> > > >
> > > > Consistency, finite_p above uses space before :, the above line
> > > > doesn't.
> > > >
> > > > > --- a/gcc/omp-expand.cc
> > > > > +++ b/gcc/omp-expand.cc
> > > > > @@ -7170,6 +7170,10 @@ expand_omp_simd (struct omp_region
> > *region,
> > > > struct omp_for_data *fd)
> > > > >        loop->latch = cont_bb;
> > > > >        add_loop (loop, l1_bb->loop_father);
> > > > >        loop->safelen = safelen_int;
> > > > > +      loop->needs_max_vf_lowering = is_in_offload_region
> > > (region);
> > > > > +      if (loop->needs_max_vf_lowering)
> > > > > +     cfun->curr_properties &= ~PROP_gimple_lomp_dev;
> > > >
> > > > Do you really need this for non-SVE arches?
> > > > I mean, could you not set loop->needs_max_vf_lowering if maximum
> > > > number of poly_int coeffs is 1?  Or if omp_max_vf returns
> constant
> > > or
> > > > something similar?
> > > Well, I guess the issue is not really about VLA vectors but when
> > host
> > > and device have different max_vf, and selecting optimal max_vf is
> > not
> > > really possible during omp-low/omp-expand, since we don't have
> > > device's target info available at this point. Andrew's recent
> patch
> > > works around this limitation by searching for "amdgcn" in
> > > OFFLOAD_TARGET_NAMES in omp_max_vf, but I guess a more general
> > > solution would be to delay lowering max_vf after streaming-out to
> > > device irrespective of VLA/VLS vectors ?
> > > For AArch64/nvptx offloading with SVE, where host is VLA and
> device
> > is
> > > VLS, the issue is more pronounced (failing to compile), compared
> to
> > > offloading from VLS host to VLS device (selecting sub-optimal
> > max_vf).
> > > >
> > > > > --- a/gcc/omp-offload.cc
> > > > > +++ b/gcc/omp-offload.cc
> > > > > @@ -2617,6 +2617,77 @@ find_simtpriv_var_op (tree *tp, int
> > > > *walk_subtrees, void *)
> > > > >    return NULL_TREE;
> > > > >  }
> > > > >
> > > > > +/* Compute max_vf for target, and accordingly set loop-
> >safelen
> > > and
> > > > length
> > > > > +   of omp simd arrays.  */
> > > > > +
> > > > > +static void
> > > > > +adjust_max_vf (function *fun)
> > > > > +{
> > > > > +  if (!fun->has_simduid_loops)
> > > > > +    return;
> > > > > +
> > > > > +  poly_uint64 max_vf = omp_max_vf (false);
> > > > > +
> > > > > +  /* Since loop->safelen has to be an integer, it's not
> always
> > > > possible
> > > > > +     to compare against poly_int.  For eg 32 and 16+16x are
> not
> > > > comparable at
> > > > > +     compile-time because 16+16x <= 32 for x < 2, but 16+16x
> >
> > 32
> > > > for x >= 2.
> > > > > +     Even if we could get runtime VL based on -mcpu/-march,
> > that
> > > > would not be
> > > > > +     portable across other SVE archs.
> > > > > +
> > > > > +     For now, use constant_lower_bound (max_vf), as a "safer
> > > > approximation" to
> > > > > +     max_vf that avoids these issues, with the downside that
> it
> > > > will be
> > > > > +     suboptimal max_vf for SVE archs implementing SIMD width
> >
> > > 128
> > > > > + bits.  */
> > > > > +
> > > > > +  uint64_t max_vf_int;
> > > > > +  if (!max_vf.is_constant (&max_vf_int))
> > > > > +    max_vf_int = constant_lower_bound (max_vf);
> > > > > +
> > > > > +  calculate_dominance_info (CDI_DOMINATORS);  for (auto loop:
> > > > > + loops_list (fun, 0))
> > > > > +    {
> > > > > +      if (!loop->needs_max_vf_lowering)
> > > > > +     continue;
> > > > > +
> > > > > +      if (loop->safelen > max_vf_int)
> > > > > +     loop->safelen = max_vf_int;
> > > > > +
> > > > > +      basic_block *bbs = get_loop_body (loop);
> > > >
> > > > I still think using the tree-vectorizer.cc infrastructure is
> much
> > > > better here.
> > > > There is no guarantee all accesses to the simd arrays will be
> > within
> > > > the loop body, in fact, none of them could be there.  Consider
> > e.g.
> > > > parts of loop body (in the C meaning) followed by noreturn
> calls,
> > > > those aren't considered loop body in the cfg.
> > > > So, I think it is much better to walk the whole function once,
> not
> > > for
> > > > each loop walk its loop body (that could be even more expensive
> if
> > > > there are nested needs_max_vf_lowering loops).
> > > > And note_simd_array_uses has it all implemented.
> > > > Building a mapping between
> > > > So, if you drop above
> > > >   calculate_dominance_info (CDI_DOMINATORS); and when seeing
> first
> > > > loop->needs_max_vf_lowering loop perform note_simd_array_uses
> > (just
> > > > once per function), for the simduid -> loop mapping you actually
> > > only
> > > > care which simduid corresponds to
> > > > loop->needs_max_vf_lowering loop and which doesn't, so say hash
> > set
> > > > loop->indexed
> > > > by simduid would be good enough.
> > > Thanks for the suggestions. The attached patch uses
> > > note_simd_array_uses to build omp simd array -> simduid mapping,
> and
> > > creates a new mapping from simduid -> safelen, and uses it to
> adjust
> > > length of omp simd array.
> > > Does it look OK ?
> > >
> > > While working on the patch, I stumbled on an issue with
> > > note_simd_array_uses, which I have filed as PR118200, and posted
> > patch
> > > for it:
> > > https://gcc.gnu.org/pipermail/gcc-patches/2024-
> December/672269.html
> > >
> > > Patch (combined with PR118200 patch) passes libgomp testing with
> > > AArch64/nvptx offloading (with/without SVE) and bootstrap+test on
> > > aarch64-linux-gnu.
> > Hi, ping: https://gcc.gnu.org/pipermail/gcc/2024-
> December/245284.html
> Hi, ping * 2: https://gcc.gnu.org/pipermail/gcc/2024-
> December/245284.html
Hi,
ping * 3: https://gcc.gnu.org/pipermail/gcc/2024-December/245284.html

Thanks,
Prathamesh
> 
> Thanks,
> Prathamesh
> >
> > Thanks,
> > Prathamesh
> > >
> > > Thanks,
> > > Prathamesh
> > > >
> > > >
> > > > > +      for (unsigned i = 0; i < loop->num_nodes; i++)
> > > > > +     for (auto gsi = gsi_start_bb (bbs[i]); !gsi_end_p (gsi);
> > > > gsi_next (&gsi))
> > > > > +       {
> > > > > +         gcall *stmt = dyn_cast<gcall *> (gsi_stmt (gsi));
> > > > > +         tree lhs = NULL_TREE;
> > > > > +
> > > > > +         if (stmt
> > > > > +             && gimple_call_internal_p (stmt)
> > > > > +             && (gimple_call_internal_fn (stmt) ==
> > > > IFN_GOMP_SIMD_LANE)
> > > > > +             && ((lhs = gimple_call_lhs (stmt)) !=
> NULL_TREE))
> > > > > +           {
> > > > > +             imm_use_iterator use_iter;
> > > > > +             gimple *use_stmt;
> > > > > +
> > > > > +             FOR_EACH_IMM_USE_STMT (use_stmt, use_iter, lhs)
> > > > > +               {
> > > > > +                 gassign *assign_stmt = dyn_cast<gassign *>
> > > > (use_stmt);
> > > > > +                 if (!assign_stmt)
> > > > > +                   continue;
> > > > > +
> > > > > +                 tree assign_stmt_lhs = gimple_assign_lhs
> > > > (assign_stmt);
> > > > > +                 if (TREE_CODE (assign_stmt_lhs) ==
> ARRAY_REF)
> > > > > +                   {
> > > > > +                     tree array = TREE_OPERAND
> > (assign_stmt_lhs,
> > > > 0);
> > > > > +                     tree elt_type = TREE_TYPE (TREE_TYPE
> > > (array));
> > > > > +                     tree atype = build_array_type_nelts
> > > (elt_type,
> > > > > +
> loop-
> > > > >safelen);
> > > > > +                     TREE_TYPE (array) = atype;
> > > > > +                     relayout_decl (array);
> > > >
> > > > Appart from the possibility that .GOMP_SIMD_LANE call will end
> up
> > > not
> > > > in the loop body, this has also the disadvantage that if there
> are
> > > say
> > > > 1000 accesses to some simd array, you change the VAR_DECL (new
> > array
> > > > type,
> > > > relayout_decl) it 1000 times.
> > > > By just traversing the hash table each one that needs to be
> > modified
> > > > will be touched just once.
> > > >
> > > >         Jakub
RE: [RFC] Enabling SVE with offloading to nvptx

Reply via email to