On Tue, Jun 10, 2025 at 3:47 AM Richard Biener <richard.guent...@gmail.com> wrote: > > On Tue, Jun 10, 2025 at 2:02 AM Andrew Pinski <pins...@gmail.com> wrote: > > > > On Mon, Jun 9, 2025 at 2:49 AM Richard Biener > > <richard.guent...@gmail.com> wrote: > > > > > > On Sun, Jun 8, 2025 at 7:52 PM Andrew Pinski <quic_apin...@quicinc.com> > > > wrote: > > > > > > > > While thinking about how to implement the rest of the copy prop and > > > > makes sure not > > > > to introduce some compile time problems, optimize_agr_copyprop should > > > > be changed > > > > into a forwproping rather than looking backwards. > > > > > > Can you explain the issues when doing the backwards looking? > > > > There is no direct problems with backwards looking. The problem comes > > if we want to say prop into a function call the original of the copy > > say: > > ``` > > a = b; > > f(a); > > ``` > > Backwards looking means you need to look back for all aggregate > > function call arguments. So you waste time always looking back. > > Ah, OK, that makes sense. > > > Instead of looking forwards only when there is a copy. > > we could also even do some proping like for: > > ``` > > a = LC0; > > _t = a[i]; > > ``` > > > > Which sometimes shows up. Here it does not matter if we load from LC0 > > or a for _t for most cores; though there might be a store forward > > penalty in some cases. Doing this prop even without fully removing the > > copy is better in the end. > > > > > > > > Note since this is a pure local "propagation" I still wonder whether > > > (now in the forward direction) > > > we want to use single_imm_use () instead of iterating over all uses of > > > the VDEF. Basically > > > we want to keep the optimization as "local" as possible. > > > > So for vops, single_imm_use is not true if we have say: > > ``` > > t = a; > > _2 = *b; > > c = t; > > ``` > > > > or: > > ``` > > t = a; > > if (c) goto BB2; > > else BB3; > > BB2: > > c = t; > > ... > > > > BB3: > > c = t; > > ``` > > Yes. That would be deliberate, similar to the situation below. I've sort-of > agreed with doing the "propagation" in forwprop because we are tackling > the copy sequences caused by C++ temporaries and early inlining, those > cases are heuristically sound to match.
I will note LLVM does exactly as I mentioned as part of their "MemCpyOptimizer/MemCpyOptPass" pass. LLVM represents a struct copy as a memcpy (with aliasing info) and uses `alloca` to allocate the space and only later on handle the lowering. Note LLVM runs the MemCpyOptPass pass after SRA; I wonder if their SRA implementation is not aggressive as GCC's is. They also don't run MemCpyOptPass pre-inlining (I am not sure LLVM have an early inliner either). I didn't feel like we wanted/needed a new pass to do this as forwprop is already run a few times and does some of it already (simplify_builtin_call). This is why I thought it would fit into forwprop also. > > > > > > > > > I also wondered about > > > > > > b = a; > > > c = b; > > > d = b; > > > > > > we'll transform c = b into c = a and that's it. > > > > > > In the end I still hope we can leverage more global analysis, like > > > that of SRA, to decide > > > whether propagation is profitable or not, thus whether we can elide a > > > temporary. In > > > that regard, shouldn't we avoid the propagation if 'DEST' has its > > > address taken given > > > we certainly cannot elide it? > > > > I am not even sure it matters on avoiding the propagation because for > > large aggregates you will be getting an memcpy and it might be faster > > not from the values which were just stored to. In the case of small > > aggregates, the memcpy will be inlined using load/stores and the RTL > > level will get rid of the loads that just happen and you end up with > > one set of loads followed by 2 sets of stores. We are also removing > > some of the store forward penalty in many cases even if we can't elide > > the copy. > > I think the important bit is to elide temporaries. I don't know if > > b = a; > c = b; > > vs. > > b = a; > c = a; > > makes a difference with regard to store forward penalties, possibly > for small sizes, but then I'd expect matching store/loads and thus > forwarding. For large sizes there is cache effects to consider, I > would expect 'a' to be evicted earlier than 'b' since 'b' needs to wait > for write-back. Of course if it's a memcpy the implementation might > use non-temporal stores for large sizes. That said, do we have any > actual data here? I don't. When I come back from vacation, I will try to gather some data for this. Including some bigger C++ programs. Maybe even LLVM. Thanks, Andrew > > Richard. > > > > > Thanks, > > Andrew > > > > > > > > Thanks, > > > Richard. > > > > > > > Bootstrapped and tested on x86_64-linux-gnu. > > > > > > > > gcc/ChangeLog: > > > > > > > > * tree-ssa-forwprop.cc (optimize_agr_copyprop): Change into a > > > > forward looking (looking at vdef's uses) instead of a back > > > > looking (vuse's def). > > > > > > > > Signed-off-by: Andrew Pinski <quic_apin...@quicinc.com> > > > > --- > > > > gcc/tree-ssa-forwprop.cc | 121 +++++++++++++++++++-------------------- > > > > 1 file changed, 60 insertions(+), 61 deletions(-) > > > > > > > > diff --git a/gcc/tree-ssa-forwprop.cc b/gcc/tree-ssa-forwprop.cc > > > > index 43b1c9d696f..2dc77ccba1d 100644 > > > > --- a/gcc/tree-ssa-forwprop.cc > > > > +++ b/gcc/tree-ssa-forwprop.cc > > > > @@ -1344,12 +1344,12 @@ optimize_memcpy_to_memset (gimple_stmt_iterator > > > > *gsip, tree dest, tree src, tree > > > > return true; > > > > } > > > > /* Optimizes > > > > - a = c; > > > > - b = a; > > > > + DEST = SRC; > > > > + DEST2 = DEST; # DEST2 = SRC2; > > > > into > > > > - a = c; > > > > - b = c; > > > > - GSIP is the second statement and SRC is the common > > > > + DEST = SRC; > > > > + DEST2 = SRC; > > > > + GSIP is the first statement and SRC is the common > > > > between the statements. > > > > */ > > > > static bool > > > > @@ -1365,65 +1365,64 @@ optimize_agr_copyprop (gimple_stmt_iterator > > > > *gsip) > > > > if (operand_equal_p (dest, src, 0)) > > > > return false; > > > > > > > > - tree vuse = gimple_vuse (stmt); > > > > - /* If the vuse is the default definition, then there is no store > > > > beforehand. */ > > > > - if (SSA_NAME_IS_DEFAULT_DEF (vuse)) > > > > - return false; > > > > - gimple *defstmt = SSA_NAME_DEF_STMT (vuse); > > > > - if (!gimple_assign_load_p (defstmt) > > > > - || !gimple_store_p (defstmt)) > > > > - return false; > > > > - if (gimple_has_volatile_ops (defstmt)) > > > > - return false; > > > > - > > > > - tree dest2 = gimple_assign_lhs (defstmt); > > > > - tree src2 = gimple_assign_rhs1 (defstmt); > > > > - > > > > - /* If the original store is `src2 = src2;` skip over it. */ > > > > - if (operand_equal_p (src2, dest2, 0)) > > > > - return false; > > > > - if (!operand_equal_p (src, dest2, 0)) > > > > - return false; > > > > - > > > > - > > > > - /* For 2 memory refences and using a temporary to do the copy, > > > > - don't remove the temporary as the 2 memory references might > > > > overlap. > > > > - Note t does not need to be decl as it could be field. > > > > - See PR 22237 for full details. > > > > - E.g. > > > > - t = *a; > > > > - *b = t; > > > > - Cannot be convert into > > > > - t = *a; > > > > - *b = *a; > > > > - Though the following is allowed to be done: > > > > - t = *a; > > > > - *a = t; > > > > - And convert it into: > > > > - t = *a; > > > > - *a = *a; > > > > - */ > > > > - if (!operand_equal_p (src2, dest, 0) > > > > - && !DECL_P (dest) && !DECL_P (src2)) > > > > - return false; > > > > - > > > > - if (dump_file && (dump_flags & TDF_DETAILS)) > > > > + tree vdef = gimple_vdef (stmt); > > > > + imm_use_iterator iter; > > > > + gimple *use_stmt; > > > > + bool changed = false; > > > > + FOR_EACH_IMM_USE_STMT (use_stmt, iter, vdef) > > > > { > > > > - fprintf (dump_file, "Simplified\n "); > > > > - print_gimple_stmt (dump_file, stmt, 0, dump_flags); > > > > - fprintf (dump_file, "after previous\n "); > > > > - print_gimple_stmt (dump_file, defstmt, 0, dump_flags); > > > > - } > > > > - gimple_assign_set_rhs_from_tree (gsip, unshare_expr (src2)); > > > > - update_stmt (stmt); > > > > + if (!gimple_assign_load_p (use_stmt) > > > > + || !gimple_store_p (use_stmt)) > > > > + continue; > > > > + if (gimple_has_volatile_ops (use_stmt)) > > > > + continue; > > > > + tree dest2 = gimple_assign_lhs (use_stmt); > > > > + tree src2 = gimple_assign_rhs1 (use_stmt); > > > > + /* If the new store is `src2 = src2;` skip over it. */ > > > > + if (operand_equal_p (src2, dest2, 0)) > > > > + continue; > > > > + if (!operand_equal_p (dest, src2, 0)) > > > > + continue; > > > > + /* For 2 memory refences and using a temporary to do the copy, > > > > + don't remove the temporary as the 2 memory references might > > > > overlap. > > > > + Note t does not need to be decl as it could be field. > > > > + See PR 22237 for full details. > > > > + E.g. > > > > + t = *a; #DEST = SRC; > > > > + *b = t; #DEST2 = SRC2; > > > > + Cannot be convert into > > > > + t = *a; > > > > + *b = *a; > > > > + Though the following is allowed to be done: > > > > + t = *a; > > > > + *a = t; > > > > + And convert it into: > > > > + t = *a; > > > > + *a = *a; > > > > + */ > > > > + if (!operand_equal_p (dest2, src, 0) > > > > + && !DECL_P (dest2) && !DECL_P (src)) > > > > + continue; > > > > + if (dump_file && (dump_flags & TDF_DETAILS)) > > > > + { > > > > + fprintf (dump_file, "Simplified\n "); > > > > + print_gimple_stmt (dump_file, use_stmt, 0, dump_flags); > > > > + fprintf (dump_file, "after previous\n "); > > > > + print_gimple_stmt (dump_file, stmt, 0, dump_flags); > > > > + } > > > > + gimple_stmt_iterator gsi = gsi_for_stmt (use_stmt); > > > > + gimple_assign_set_rhs_from_tree (&gsi, unshare_expr (src)); > > > > + update_stmt (use_stmt); > > > > > > > > - if (dump_file && (dump_flags & TDF_DETAILS)) > > > > - { > > > > - fprintf (dump_file, "into\n "); > > > > - print_gimple_stmt (dump_file, stmt, 0, dump_flags); > > > > + if (dump_file && (dump_flags & TDF_DETAILS)) > > > > + { > > > > + fprintf (dump_file, "into\n "); > > > > + print_gimple_stmt (dump_file, use_stmt, 0, dump_flags); > > > > + } > > > > + statistics_counter_event (cfun, "copy prop for aggregate", 1); > > > > + changed = true; > > > > } > > > > - statistics_counter_event (cfun, "copy prop for aggregate", 1); > > > > - return true; > > > > + return changed; > > > > } > > > > > > > > /* *GSI_P is a GIMPLE_CALL to a builtin function. > > > > -- > > > > 2.43.0 > > > >