On Tue, Jun 10, 2025 at 3:47 AM Richard Biener
<richard.guent...@gmail.com> wrote:
>
> On Tue, Jun 10, 2025 at 2:02 AM Andrew Pinski <pins...@gmail.com> wrote:
> >
> > On Mon, Jun 9, 2025 at 2:49 AM Richard Biener
> > <richard.guent...@gmail.com> wrote:
> > >
> > > On Sun, Jun 8, 2025 at 7:52 PM Andrew Pinski <quic_apin...@quicinc.com> 
> > > wrote:
> > > >
> > > > While thinking about how to implement the rest of the copy prop and 
> > > > makes sure not
> > > > to introduce some compile time problems, optimize_agr_copyprop should 
> > > > be changed
> > > > into a forwproping rather than looking backwards.
> > >
> > > Can you explain the issues when doing the backwards looking?
> >
> > There is no direct problems with backwards looking. The problem comes
> > if we want to say prop into a function call the original of the copy
> > say:
> > ```
> > a = b;
> > f(a);
> > ```
> > Backwards looking means you need to look back for all aggregate
> > function call arguments. So you waste time always looking back.
>
> Ah, OK, that makes sense.
>
> > Instead of looking forwards only when there is a copy.
> > we could also even do some proping like for:
> > ```
> > a = LC0;
> > _t = a[i];
> > ```
> >
> > Which sometimes shows up. Here it does not matter if we load from LC0
> > or a for _t for most cores; though there might be a store forward
> > penalty in some cases. Doing this prop even without fully removing the
> > copy is better in the end.
> >
> > >
> > > Note since this is a pure local "propagation" I still wonder whether
> > > (now in the forward direction)
> > > we want to use single_imm_use () instead of iterating over all uses of
> > > the VDEF.  Basically
> > > we want to keep the optimization as "local" as possible.
> >
> > So for vops, single_imm_use is not true if we have say:
> > ```
> > t = a;
> > _2 = *b;
> > c = t;
> > ```
> >
> > or:
> > ```
> > t = a;
> > if (c) goto BB2;
> > else BB3;
> > BB2:
> >   c = t;
> > ...
> >
> > BB3:
> >   c = t;
> > ```
>
> Yes.  That would be deliberate, similar to the situation below.  I've sort-of
> agreed with doing the "propagation" in forwprop because we are tackling
> the copy sequences caused by C++ temporaries and early inlining, those
> cases are heuristically sound to match.

I will note LLVM does exactly as I mentioned as part of their
"MemCpyOptimizer/MemCpyOptPass" pass. LLVM represents a struct copy as
a memcpy (with aliasing info) and uses `alloca` to allocate the space
and only later on handle the lowering.
Note LLVM runs the MemCpyOptPass pass after SRA; I wonder if their SRA
implementation is not aggressive as GCC's is. They also don't run
MemCpyOptPass pre-inlining (I am not sure LLVM have an early inliner
either).

I didn't feel like we wanted/needed a new pass to do this as forwprop
is already run a few times and does some of it already
(simplify_builtin_call). This is why I thought it would fit into
forwprop also.

>
> >
> > >
> > > I also wondered about
> > >
> > > b = a;
> > > c = b;
> > > d = b;
> > >
> > > we'll transform c = b into c = a and that's it.
> > >
> > > In the end I still hope we can leverage more global analysis, like
> > > that of SRA, to decide
> > > whether propagation is profitable or not, thus whether we can elide a
> > > temporary.  In
> > > that regard, shouldn't we avoid the propagation if 'DEST' has its
> > > address taken given
> > > we certainly cannot elide it?
> >
> > I am not even sure it matters on avoiding the propagation because for
> > large aggregates you will be getting an memcpy and it might be faster
> > not from the values which were just stored to. In the case of small
> > aggregates, the memcpy will be inlined using load/stores and the RTL
> > level will get rid of the loads that just happen and you end up with
> > one set of loads followed by 2 sets of stores. We are also removing
> > some of the store forward penalty in many cases even if we can't elide
> > the copy.
>
> I think the important bit is to elide temporaries.  I don't know if
>
>  b = a;
>  c = b;
>
> vs.
>
>  b = a;
>  c = a;
>
> makes a difference with regard to store forward penalties, possibly
> for small sizes, but then I'd expect matching store/loads and thus
> forwarding.  For large sizes there is cache effects to consider, I
> would expect 'a' to be evicted earlier than 'b' since 'b' needs to wait
> for write-back.  Of course if it's a memcpy the implementation might
> use non-temporal stores for large sizes.  That said, do we have any
> actual data here?

I don't. When I come back from vacation, I will try to gather some
data for this. Including some bigger C++ programs. Maybe even LLVM.

Thanks,
Andrew


>
> Richard.
>
> >
> > Thanks,
> > Andrew
> >
> > >
> > > Thanks,
> > > Richard.
> > >
> > > > Bootstrapped and tested on x86_64-linux-gnu.
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > >         * tree-ssa-forwprop.cc (optimize_agr_copyprop): Change into a
> > > >         forward looking (looking at vdef's uses) instead of a back
> > > >         looking (vuse's def).
> > > >
> > > > Signed-off-by: Andrew Pinski <quic_apin...@quicinc.com>
> > > > ---
> > > >  gcc/tree-ssa-forwprop.cc | 121 +++++++++++++++++++--------------------
> > > >  1 file changed, 60 insertions(+), 61 deletions(-)
> > > >
> > > > diff --git a/gcc/tree-ssa-forwprop.cc b/gcc/tree-ssa-forwprop.cc
> > > > index 43b1c9d696f..2dc77ccba1d 100644
> > > > --- a/gcc/tree-ssa-forwprop.cc
> > > > +++ b/gcc/tree-ssa-forwprop.cc
> > > > @@ -1344,12 +1344,12 @@ optimize_memcpy_to_memset (gimple_stmt_iterator 
> > > > *gsip, tree dest, tree src, tree
> > > >    return true;
> > > >  }
> > > >  /* Optimizes
> > > > -   a = c;
> > > > -   b = a;
> > > > +   DEST = SRC;
> > > > +   DEST2 = DEST; # DEST2 = SRC2;
> > > >     into
> > > > -   a = c;
> > > > -   b = c;
> > > > -   GSIP is the second statement and SRC is the common
> > > > +   DEST = SRC;
> > > > +   DEST2 = SRC;
> > > > +   GSIP is the first statement and SRC is the common
> > > >     between the statements.
> > > >  */
> > > >  static bool
> > > > @@ -1365,65 +1365,64 @@ optimize_agr_copyprop (gimple_stmt_iterator 
> > > > *gsip)
> > > >    if (operand_equal_p (dest, src, 0))
> > > >      return false;
> > > >
> > > > -  tree vuse = gimple_vuse (stmt);
> > > > -  /* If the vuse is the default definition, then there is no store 
> > > > beforehand.  */
> > > > -  if (SSA_NAME_IS_DEFAULT_DEF (vuse))
> > > > -    return false;
> > > > -  gimple *defstmt = SSA_NAME_DEF_STMT (vuse);
> > > > -  if (!gimple_assign_load_p (defstmt)
> > > > -      || !gimple_store_p (defstmt))
> > > > -    return false;
> > > > -  if (gimple_has_volatile_ops (defstmt))
> > > > -    return false;
> > > > -
> > > > -  tree dest2 = gimple_assign_lhs (defstmt);
> > > > -  tree src2 = gimple_assign_rhs1 (defstmt);
> > > > -
> > > > -  /* If the original store is `src2 = src2;` skip over it. */
> > > > -  if (operand_equal_p (src2, dest2, 0))
> > > > -    return false;
> > > > -  if (!operand_equal_p (src, dest2, 0))
> > > > -    return false;
> > > > -
> > > > -
> > > > -  /* For 2 memory refences and using a temporary to do the copy,
> > > > -     don't remove the temporary as the 2 memory references might 
> > > > overlap.
> > > > -     Note t does not need to be decl as it could be field.
> > > > -     See PR 22237 for full details.
> > > > -     E.g.
> > > > -     t = *a;
> > > > -     *b = t;
> > > > -     Cannot be convert into
> > > > -     t = *a;
> > > > -     *b = *a;
> > > > -     Though the following is allowed to be done:
> > > > -     t = *a;
> > > > -     *a = t;
> > > > -     And convert it into:
> > > > -     t = *a;
> > > > -     *a = *a;
> > > > -  */
> > > > -  if (!operand_equal_p (src2, dest, 0)
> > > > -      && !DECL_P (dest) && !DECL_P (src2))
> > > > -    return false;
> > > > -
> > > > -  if (dump_file && (dump_flags & TDF_DETAILS))
> > > > +  tree vdef = gimple_vdef (stmt);
> > > > +  imm_use_iterator iter;
> > > > +  gimple *use_stmt;
> > > > +  bool changed = false;
> > > > +  FOR_EACH_IMM_USE_STMT (use_stmt, iter, vdef)
> > > >      {
> > > > -      fprintf (dump_file, "Simplified\n  ");
> > > > -      print_gimple_stmt (dump_file, stmt, 0, dump_flags);
> > > > -      fprintf (dump_file, "after previous\n  ");
> > > > -      print_gimple_stmt (dump_file, defstmt, 0, dump_flags);
> > > > -    }
> > > > -  gimple_assign_set_rhs_from_tree (gsip, unshare_expr (src2));
> > > > -  update_stmt (stmt);
> > > > +      if (!gimple_assign_load_p (use_stmt)
> > > > +         || !gimple_store_p (use_stmt))
> > > > +       continue;
> > > > +      if (gimple_has_volatile_ops (use_stmt))
> > > > +       continue;
> > > > +      tree dest2 = gimple_assign_lhs (use_stmt);
> > > > +      tree src2 = gimple_assign_rhs1 (use_stmt);
> > > > +      /* If the new store is `src2 = src2;` skip over it. */
> > > > +      if (operand_equal_p (src2, dest2, 0))
> > > > +       continue;
> > > > +      if (!operand_equal_p (dest, src2, 0))
> > > > +       continue;
> > > > +      /* For 2 memory refences and using a temporary to do the copy,
> > > > +        don't remove the temporary as the 2 memory references might 
> > > > overlap.
> > > > +        Note t does not need to be decl as it could be field.
> > > > +        See PR 22237 for full details.
> > > > +        E.g.
> > > > +        t = *a; #DEST = SRC;
> > > > +        *b = t; #DEST2 = SRC2;
> > > > +        Cannot be convert into
> > > > +        t = *a;
> > > > +        *b = *a;
> > > > +        Though the following is allowed to be done:
> > > > +        t = *a;
> > > > +        *a = t;
> > > > +        And convert it into:
> > > > +        t = *a;
> > > > +        *a = *a;
> > > > +       */
> > > > +      if (!operand_equal_p (dest2, src, 0)
> > > > +         && !DECL_P (dest2) && !DECL_P (src))
> > > > +       continue;
> > > > +      if (dump_file && (dump_flags & TDF_DETAILS))
> > > > +       {
> > > > +         fprintf (dump_file, "Simplified\n  ");
> > > > +         print_gimple_stmt (dump_file, use_stmt, 0, dump_flags);
> > > > +         fprintf (dump_file, "after previous\n  ");
> > > > +         print_gimple_stmt (dump_file, stmt, 0, dump_flags);
> > > > +       }
> > > > +      gimple_stmt_iterator gsi = gsi_for_stmt (use_stmt);
> > > > +      gimple_assign_set_rhs_from_tree (&gsi, unshare_expr (src));
> > > > +      update_stmt (use_stmt);
> > > >
> > > > -  if (dump_file && (dump_flags & TDF_DETAILS))
> > > > -    {
> > > > -      fprintf (dump_file, "into\n  ");
> > > > -      print_gimple_stmt (dump_file, stmt, 0, dump_flags);
> > > > +      if (dump_file && (dump_flags & TDF_DETAILS))
> > > > +       {
> > > > +         fprintf (dump_file, "into\n  ");
> > > > +         print_gimple_stmt (dump_file, use_stmt, 0, dump_flags);
> > > > +       }
> > > > +      statistics_counter_event (cfun, "copy prop for aggregate", 1);
> > > > +      changed = true;
> > > >      }
> > > > -  statistics_counter_event (cfun, "copy prop for aggregate", 1);
> > > > -  return true;
> > > > +  return changed;
> > > >  }
> > > >
> > > >  /* *GSI_P is a GIMPLE_CALL to a builtin function.
> > > > --
> > > > 2.43.0
> > > >

Reply via email to