On Wed, 5 Oct 2011, William J. Schmidt wrote: > This patch addresses the poor code generation in PR46556 for the > following code: > > struct x > { > int a[16]; > int b[16]; > int c[16]; > }; > > extern void foo (int, int, int); > > void > f (struct x *p, unsigned int n) > { > foo (p->a[n], p->c[n], p->b[n]); > } > > Prior to the fix for PR32698, gcc calculated the offset for accessing > the array elements as: n*4; 64+n*4; 128+n*4. > > Following that fix, the offsets are calculated as: n*4; (n+16)*4; (n > +32)*4. This led to poor code generation on powerpc64 targets, among > others. > > The poor code generation was observed to not occur in loops, as the > IVOPTS code does a good job of lowering these expressions to MEM_REFs. > It was previously suggested that perhaps a general pass to lower memory > accesses to MEM_REFs in GIMPLE would solve not only this, but other > similar problems. I spent some time looking into various approaches to > this, and reviewing some previous attempts to do similar things. In the > end, I've concluded that this is a bad idea in practice because of the > loss of useful aliasing information. In particular, early lowering of > component references causes us to lose the ability to disambiguate > non-overlapping references in the same structure, and there is no simple > way to carry the necessary aliasing information along with the > replacement MEM_REFs to avoid this. While some performance gains are > available with GIMPLE lowering of memory accesses, there are also > offsetting performance losses, and I suspect this would just be a > continuous source of bug reports into the future. > > Therefore the current patch is a much simpler approach to solve the > specific problem noted in the PR. There are two pieces to the patch: > > * The offending addressing pattern is matched in GIMPLE and transformed > into a restructured MEM_REF that distributes the multiply, so that (n > +32)*4 becomes 4*n+128 as before. This is done during the reassociation > pass, for reasons described below. The transformation only occurs in > non-loop blocks, since IVOPTS does a good job on such things within > loops. > * A tweak is added to the RTL forward-propagator to avoid propagating > into memory references based on a single base register with no offset, > under certain circumstances. This improves sharing of base registers > for accesses within the same structure and slightly lowers register > pressure. > > It would be possible to separate these into two patches if that's > preferred. I chose to combine them because together they provide the > ideal code generation that the new test cases test for. > > I initially implemented the pattern matcher during expand, but I found > that the expanded code for two accesses to the same structure was often > not being CSEd well. So I moved it back into the GIMPLE phases prior to > DOM to take advantage of its CSE. To avoid an additional complete pass > over the IL, I chose to piggyback on the reassociation pass. This > transformation is not technically a reassociation, but it is related > enough to not be a complete wart. > > One noob question about this: It would probably be preferable to have > this transformation only take place during the second reassociation > pass, so the ARRAY_REFs are seen by earlier optimization phases. Is > there an easy way to detect that it's the second pass without having to > generate a separate pass entry point? > > One other general question about the pattern-match transformation: Is > this an appropriate transformation for all targets, or should it be > somehow gated on available addressing modes on the target processor? > > Bootstrapped and regression tested on powerpc64-linux-gnu. Verified no > performance degradations on that target for SPEC CPU2000 and CPU2006. > > I'm looking for eventual approval for trunk after any comments are > resolved. Thanks!
People have already commented on pieces, so I'm looking only at the tree-ssa-reassoc.c pieces (did you consider piggy-backing on IVOPTs instead? The idea is to expose additional CSE opportunities, right? So it's sort-of a strength-reduction optimization on scalar code (classically strength reduction in loops transforms for (i) { z = i*x; } to z = 0; for (i) { z += x }). That might be worth in general, even for non-address cases. So - if you rename that thing to tree-ssa-strength-reduce.c you can get away without piggy-backing on anything ;) If you structure it to detect a strength reduction opportunity (thus, you'd need to match two/multiple of the patterns at the same time) that would be a bonus ... generalizing it a little bit would be another. Now some comments on the patch ... > Bill > > > 2011-10-05 Bill Schmidt <wschm...@linux.vnet.ibm.com> > > gcc: > > PR rtl-optimization/46556 > * fwprop.c (fwprop_bb_aux_d): New struct. > (MEM_PLUS_REGS): New macro. > (record_mem_plus_reg): New function. > (record_mem_plus_regs): Likewise. > (single_def_use_enter_block): Record mem-plus-reg patterns. > (build_single_def_use_links): Allocate aux storage. > (locally_poor_mem_replacement): New function. > (forward_propagate_and_simplify): Call > locally_poor_mem_replacement. > (fwprop_init): Free storage. > * tree.h (copy_ref_info): Expose existing function. > * tree-ssa-loop-ivopts.c (copy_ref_info): Remove static token. > * tree-ssa-reassoc.c (restructure_base_and_offset): New function. > (restructure_mem_ref): Likewise. > (reassociate_bb): Look for opportunities to call > restructure_mem_ref; clean up immediate use lists. > > gcc/testsuite: > > PR rtl-optimization/46556 > * gcc.target/powerpc/ppc-pr46556-1.c: New testcase. > * gcc.target/powerpc/ppc-pr46556-2.c: Likewise. > * gcc.target/powerpc/ppc-pr46556-3.c: Likewise. > * gcc.target/powerpc/ppc-pr46556-4.c: Likewise. > * gcc.dg/tree-ssa/pr46556-1.c: Likewise. > * gcc.dg/tree-ssa/pr46556-2.c: Likewise. > * gcc.dg/tree-ssa/pr46556-3.c: Likewise. > Index: gcc/tree-ssa-reassoc.c > =================================================================== > --- gcc/tree-ssa-reassoc.c (revision 179317) > +++ gcc/tree-ssa-reassoc.c (working copy) > @@ -2361,6 +2361,116 @@ break_up_subtract_bb (basic_block bb) > break_up_subtract_bb (son); > } > > +/* Given BASE and OFFSET derived from *EXPR occurring in stmt *GSI, > + determine whether there is a profitable opportunity to restructure > + address arithmetic within BASE and OFFSET. If so, produce such > + a restructuring in *MEM_REF and return true; else return false. */ > + > +static tree > +restructure_base_and_offset (tree *expr, gimple_stmt_iterator *gsi, > + tree base, tree offset, HOST_WIDE_INT bitpos) > +{ > + tree mult_op0, mult_op1, t1, t2, offset_type, mult_expr, add_expr, mem_ref; > + HOST_WIDE_INT c, c1, c2, c3, c4; > + > + /* Look for the following pattern: > + > + base = MEM_REF (T1, C1); > + offset = MULT_EXPR (PLUS_EXPR (T2, C2), C3) > + bitpos = C4 * BITS_PER_UNIT > + > + If found, convert it to: > + > + MEM_REF (POINTER_PLUS_EXPR (T1, MULT_EXPR (T2, C3)), > + C1 + (C2 * C3) + C4) > + */ > + if (!base > + || !offset > + || TREE_CODE (base) != MEM_REF > + || TREE_CODE (TREE_OPERAND (base, 1)) != INTEGER_CST > + || TREE_CODE (offset) != MULT_EXPR) > + return NULL_TREE; > + > + mult_op0 = TREE_OPERAND (offset, 0); > + mult_op1 = TREE_OPERAND (offset, 1); > + > + if (TREE_CODE (mult_op0) != PLUS_EXPR > + || TREE_CODE (mult_op1) != INTEGER_CST > + || TREE_CODE (TREE_OPERAND (mult_op0, 1)) != INTEGER_CST) > + return NULL_TREE; > + > + t1 = TREE_OPERAND (base, 0); > + t2 = TREE_OPERAND (mult_op0, 0); > + c1 = TREE_INT_CST_LOW (TREE_OPERAND (base, 1)); > + c2 = TREE_INT_CST_LOW (TREE_OPERAND (mult_op0, 1)); > + c3 = TREE_INT_CST_LOW (mult_op1); Before accessing TREE_INT_CST_LOW you need to make sure the constants fit into a HWI using host_integerp () (which conveniently includes the check for INTEGER_CST). Note that you need to sign-extend the MEM_REF offset, thus use mem_ref_offset (base).low instead of TREE_INT_CST_LOW (TREE_OPERAND (base, 1)). Might be worth to add a testcase with negative offset ;) > + c4 = bitpos / BITS_PER_UNIT; > + c = c1 + c2 * c3 + c4; And you don't know whether this operation overflows. Thus it's probably easiest to use double_ints instead of HOST_WIDE_INTs in all of the code. > + > + offset_type = TREE_TYPE (TREE_OPERAND (base, 1)); > + > + mult_expr = fold_build2 (MULT_EXPR, sizetype, t2, > + build_int_cst (sizetype, c3)); > + mult_expr = force_gimple_operand_gsi (gsi, mult_expr, true, NULL, > + true, GSI_SAME_STMT); > + add_expr = fold_build2 (POINTER_PLUS_EXPR, TREE_TYPE (t1), t1, mult_expr); > + add_expr = force_gimple_operand_gsi (gsi, add_expr, true, NULL, > + true, GSI_SAME_STMT); > + mem_ref = fold_build2 (MEM_REF, TREE_TYPE (*expr), add_expr, > + build_int_cst (offset_type, c)); > + return mem_ref; > +} > + > +/* If *EXPR contains a memory reference, determine whether there is an > + opportunity to restructure the base and offset expressions for > + improved performance. */ > + > +static bool > +restructure_mem_ref (tree *expr, gimple_stmt_iterator *gsi) > +{ > + enum tree_code code = TREE_CODE (*expr); > + tree base, offset, mem_ref; > + HOST_WIDE_INT bitsize, bitpos; > + enum machine_mode mode; > + int unsignedp, volatilep; > + > + /* Only expressions that reference memory are of interest. Bitfield > + references should eventually be lowered prior to this point and > + are not dealt with. */ > + if (! handled_component_p (*expr) > + || code == BIT_FIELD_REF > + || (code == COMPONENT_REF && DECL_BIT_FIELD (TREE_OPERAND (*expr, 1)))) > + return false; I'd say you want to handle references with a non-BLKmode only here, thus || TYPE_MODE (TREE_TYPE (*expr)) == BLKmode > + /* Determine whether the expression can be represented with base and > + offset components. */ > + base = get_inner_reference (*expr, &bitsize, &bitpos, &offset, &mode, > + &unsignedp, &volatilep, false); > + if (!base || !offset) > + return false; > + > + /* Look for a restructuring opportunity. */ > + if ((mem_ref = restructure_base_and_offset (expr, gsi, base, > + offset, bitpos)) == NULL_TREE) > + return false; What I'm missing is a check whether the old address computation stmts will be dead after the transform. > + /* Found one. Record the replacement in the dump file and complete > + the replacement. */ > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + fprintf (dump_file, "\nOriginal_expr = "); > + print_generic_expr (dump_file, *expr, TDF_VOPS|TDF_MEMSYMS); > + fprintf (dump_file, "\nmem_ref = "); > + print_generic_expr (dump_file, mem_ref, TDF_VOPS|TDF_MEMSYMS); > + fprintf (dump_file, "\n\n"); > + } > + > + copy_ref_info (mem_ref, *expr); > + *expr = mem_ref; > + > + return true; > +} > + > /* Reassociate expressions in basic block BB and its post-dominator as > children. */ > > @@ -2369,14 +2479,30 @@ reassociate_bb (basic_block bb) > { > gimple_stmt_iterator gsi; > basic_block son; > + bool chgd_mem_ref = false; > + bool in_loop = loop_depth (bb->loop_father) != 0; > > for (gsi = gsi_last_bb (bb); !gsi_end_p (gsi); gsi_prev (&gsi)) > { > gimple stmt = gsi_stmt (gsi); > > - if (is_gimple_assign (stmt) > - && !stmt_could_throw_p (stmt)) > + /* Look for restructuring opportunities within an expression > + that references memory. We only do this for blocks not > + contained in loops, since the ivopts machinery does a > + good job on loop expressions, and we don't want to interfere > + with other loop optimizations. */ > + if (!in_loop && gimple_vuse (stmt) && gimple_assign_single_p (stmt)) > { > + tree *lhs, *rhs; > + lhs = gimple_assign_lhs_ptr (stmt); > + chgd_mem_ref = restructure_mem_ref (lhs, &gsi) || chgd_mem_ref; > + rhs = gimple_assign_rhs1_ptr (stmt); > + chgd_mem_ref = restructure_mem_ref (rhs, &gsi) || chgd_mem_ref; It will either be a store or a load, but never both (unless it's an aggregate copy which I think we should not handle). So ... if (gimple_vdef (stmt)) ... lhs else if (gimple_vuse (stmt)) ... rhs > + } > + > + else if (is_gimple_assign (stmt) > + && !stmt_could_throw_p (stmt)) > + { > tree lhs, rhs1, rhs2; > enum tree_code rhs_code = gimple_assign_rhs_code (stmt); > > @@ -2489,6 +2615,12 @@ reassociate_bb (basic_block bb) > } > } > } > + /* If memory references have been restructured, immediate uses need > + to be cleaned up. */ > + if (chgd_mem_ref) > + for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi)) > + update_stmt (gsi_stmt (gsi)); ICK. Definitely a no ;) Why does a update_stmt () after the restructure_mem_ref call not work? Richard.