Predictive commoning leads to register to register moves through memory.
Hello all, I've been looking at a code generation issue with GCC 5.2 lately dealing with register to register moves through memory with -O3 -funroll-loops. For reference the C code is at the end of this mail. The generated code for mips is (cut down for clarity, ldc1 and sdc1 are double word floating point stores): div.d $f8,$f6,$f4 mul.d $f2,$f8,$f8 sdc1$f2,8($7) $L38: ldc1$f0,8($7) <- load instead of move li $11,1 # 0x1 $L49: div.d $f8,$f6,$f4 addiu $11,$10,3 mul.d $f2,$f8,$f8 sdc1$f2,8($7) ldc1$f0,8($7) <- load instead of move $L48: mul.d $f2,$f2,$f0 $L45: mul.d $f2,$f4,$f4 mov.d $f8,$f4 j $L38 sdc1$f2,8($7) For the basic block L38, all dominating blocks store to 8($7) which is then loaded back into another floating register. Disabling predictive commoning generates: div.d $f4,$f18,$f2 mul.d $f0,$f4,$f4 $L37: mul.d $f6,$f0,$f0 li $10,1 # 0x1 mul.d $f8,$f0,$f6 mul.d $f10,$f0,$f8 mul.d $f12,$f0,$f10 mul.d $f14,$f0,$f12 mul.d $f16,$f0,$f14 beq $4,$10,$L38 mul.d $f20,$f0,$f16 For the same basic block. Following Jeff's advice[1] to extract more information from GCC, I've narrowed the cause down to the predictive commoning pass inserting the load in a loop header style basic block. However, the next pass in GCC, tree-cunroll promptly removes the loop and joins the loop header to the body of the (non)loop. More oddly, disabling conditional store elimination pass or the dominator optimizations pass or disabling of jump-threading with --param max-jump-thread-duplication-stmts=0 nets the above assembly code. Any ideas on an approach for this issue? [1] https://gcc.gnu.org/ml/gcc-help/2015-08/msg00162.html Thanks, Simon double N; int i1; double T; double poly[9]; void g (int iterations) { int count = 0; for (count = 0; count < iterations; count++) { if (N > 1) { T = 1 / N; } else { T = N; } poly[1] = T * T; for (i1 = 2; i1 <= 8; i1++) { poly[i1] = poly[i1 - 1] * poly[1]; } } return; }
RE: Predictive commoning leads to register to register moves through memory.
I've since taken another look at this recently and I've tracked the issue down to tree-predcom.c, specifically ref_at_iteration almost always generating MEM_REFs. With MEM_REFs, GCC's RTL GCSE cannot compare them as equal and hence remove them. A previous version of the code did generate ARRAY_REFs (pre 204458), but that was changed to generate MEM_REFs for pr/58653. Would something like: --- a/gcc/tree-predcom.c +++ b/gcc/tree-predcom.c @@ -1409,7 +1409,21 @@ ref_at_iteration (data_reference_p dr, int iter, gimple_seq *stmts) addr, alias_ptr), DECL_SIZE (field), bitsize_zero_node); } - else + /* Generate an ARRAY_REF for array references when all details are INTEGER_CST + rather than a MEM_REF so that CSE passes can potientially optimize them. */ + else if (TREE_CODE (DR_REF (dr)) == ARRAY_REF + && TREE_CODE (DR_STEP (dr)) == INTEGER_CST + && TREE_CODE (DR_INIT (dr)) == INTEGER_CST + && TREE_CODE (DR_OFFSET (dr)) == INTEGER_CST) +{ + /* Reverse engineer the element from memory offset. */ + tree offset = size_binop (MINUS_EXPR, coff, off); + tree sizdiv = TYPE_SIZE (TREE_TYPE (TREE_TYPE (DR_BASE_OBJECT (dr; + sizdiv = div_if_zero_remainder (EXACT_DIV_EXPR, sizdiv, ssize_int (BITS_PER_UNIT)); + tree element = div_if_zero_remainder (EXACT_DIV_EXPR, offset, sizdiv); + if (element != NULL_TREE) + return build4 (ARRAY_REF, TREE_TYPE (DR_REF (dr)), DR_BASE_OBJECT (dr), +element, NULL_TREE, NULL_TREE); +} return fold_build2 (MEM_REF, TREE_TYPE (DR_REF (dr)), addr, alias_ptr); be an appropriate start to fixing this? That fix appears to work in in my testing. Thanks, Simon -Original Message- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: 31 August 2015 11:40 To: Jeff Law Cc: Simon Dardis; gcc@gcc.gnu.org Subject: Re: Predictive commoning leads to register to register moves through memory. On Fri, Aug 28, 2015 at 5:48 PM, Jeff Law wrote: > On 08/28/2015 09:43 AM, Simon Dardis wrote: > >> Following Jeff's advice[1] to extract more information from GCC, I've >> narrowed the cause down to the predictive commoning pass inserting >> the load in a loop header style basic block. However, the next pass >> in GCC, tree-cunroll promptly removes the loop and joins the loop >> header to the body of the (non)loop. More oddly, disabling >> conditional store elimination pass or the dominator optimizations >> pass or disabling of jump-threading with --param >> max-jump-thread-duplication-stmts=0 nets the above assembly code. Any >> ideas on an approach for this issue? > > I'd probably start by looking at the .optimized tree dump in both > cases to understand the difference, then (most liklely) tracing that > through the RTL optimizers into the register allocator. It's the known issue of LIM (here the one after pcom and complete unrolling of the inner loop) being too aggressive with store-motion. Here the comptete array is replaced with registers for the outer loop. Were 'poly' a local variable we'd have optimized it away completely. : _8 = 1.0e+0 / pretmp_42; _12 = _8 * _8; poly[1] = _12; : # prephitmp_30 = PHI <_12(6), _36(9)> # T_lsm.8_22 = PHI <_8(6), pretmp_42(9)> poly_I_lsm0.10_38 = MEM[(double *)&poly + 8B]; _2 = prephitmp_30 * poly_I_lsm0.10_38; _54 = _2 * poly_I_lsm0.10_38; _67 = poly_I_lsm0.10_38 * _54; _80 = poly_I_lsm0.10_38 * _67; _93 = poly_I_lsm0.10_38 * _80; _106 = poly_I_lsm0.10_38 * _93; _19 = poly_I_lsm0.10_38 * _106; count_23 = count_28 + 1; if (count_23 != iterations_6(D)) goto ; else goto ; : poly[2] = _2; poly[3] = _54; poly[4] = _67; poly[5] = _80; poly[6] = _93; poly[7] = _106; poly[8] = _19; i1 = 9; T = T_lsm.8_22; note that DOM misses to CSE poly[1] (a known defect), but heh, doing that would only increase register pressure even more. Note the above is on x86_64. Richard. > jeff
RE: Predictive commoning leads to register to register moves through memory.
I took an attempt at addressing this through the RTL GCSE pass. This attempt tweaks mem_attrs_eq_p to return true if its comparing something like poly+8 and MEM [&poly + 8]. Is this a more suitable approach? Thanks, Simon +/* Return true if p and q reference the same location by the same name but + through VAR_DECL and MEM_REF. */ + +static bool +mem_locations_match_p (const struct mem_attrs *p, const struct mem_attrs *q) +{ + HOST_WIDE_INT var_offset; + tree var, memref; + + if (p->expr == NULL_TREE || q->expr == NULL_TREE) +return false; + + if (TREE_CODE (p->expr) == MEM_REF && TREE_CODE (q->expr) == VAR_DECL) +{ + var = q->expr; + var_offset = q->offset; + memref = p->expr; +} + else if (TREE_CODE (q->expr) == MEM_REF && TREE_CODE (p->expr) == VAR_DECL) +{ + var = p->expr; + var_offset = p->offset; + memref = q->expr; +} + else +return false; + + if (TREE_OPERAND (TREE_OPERAND (memref, 0), 0) != var) +return false; + + if (TREE_TYPE (TREE_TYPE (var)) != TREE_TYPE (memref)) +return false; + + tree offset = TREE_OPERAND (memref, 1); + if ((TREE_CODE (offset) == INTEGER_CST && tree_fits_shwi_p (offset) + && tree_to_shwi (offset) == var_offset) + || offset == NULL_TREE && var_offset == 0) +return true; + + return false; + +} + /* Return true if the given memory attributes are equal. */ bool @@ -254,16 +298,16 @@ mem_attrs_eq_p (const struct mem_attrs *p, const struct mem_attrs *q) return false; return (p->alias == q->alias && p->offset_known_p == q->offset_known_p - && (!p->offset_known_p || p->offset == q->offset) && p->size_known_p == q->size_known_p && (!p->size_known_p || p->size == q->size) && p->align == q->align && p->addrspace == q->addrspace - && (p->expr == q->expr - || (p->expr != NULL_TREE && q->expr != NULL_TREE - && operand_equal_p (p->expr, q->expr, 0; + && (mem_locations_match_p (p, q) +|| (!p->offset_known_p || p->offset == q->offset) +&& (p->expr == q->expr +|| (p->expr != NULL_TREE && q->expr != NULL_TREE +&& operand_equal_p (p->expr, q->expr, 0); }
RE: Predictive commoning leads to register to register moves through memory.
It seems to me that extending operand_equal_p to deal with the MEM_REF/ARRAY_REF case could help. Your recent change of: if (TREE_CODE (arg0) == MEM_REF && DECL_P (arg1) && TREE_CODE (TREE_OPERAND (arg0, 0)) == ADDR_EXPR && TREE_OPERAND (TREE_OPERAND (arg0, 0), 0) == arg1 && integer_zerop (TREE_OPERAND (arg0, 1))) return 1; else if (TREE_CODE (arg1) == MEM_REF && DECL_P (arg0) && TREE_CODE (TREE_OPERAND (arg1, 0)) == ADDR_EXPR && TREE_OPERAND (TREE_OPERAND (arg1, 0), 0) == arg0 && integer_zerop (TREE_OPERAND (arg1, 1))) return 1; return 0; only seems to be handing the case of 'X' and MEM_REF[&X + 0]. Do you think changing this to be 'return addr_eq_p (arg0, arg1);' where addr_eq_p handles cases of ARRAY_(RANGED_)REF, COMPONENT_REF by checking offset as well would be useful? I've put together an initial version which splits out change to operand_equal_p into its own predicate for just IDing cases of base objects and the above mentioned addr_eq_p. Also in that patch is the change for mem_attrs_eq_p as in that case the offset is not part of the TREE expression, so it has to be handled differently. Thoughts? Thanks, Simon -Original Message- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: 22 September 2015 12:12 To: Simon Dardis Cc: Jeff Law; gcc@gcc.gnu.org Subject: Re: Predictive commoning leads to register to register moves through memory. On Tue, Sep 22, 2015 at 12:45 PM, Simon Dardis wrote: > I took an attempt at addressing this through the RTL GCSE pass. This > attempt tweaks mem_attrs_eq_p to return true if its comparing something like > poly+8 and MEM [&poly + 8]. > > Is this a more suitable approach? I actually recently modified stmt_kills_ref_p for a similar reason... not that I liked that vey much. I think the more suitable approach would be to not have both forms if they are actually equal. Of course that's way more work. So splitting out a function that handles sematic equality compare of MEM_EXPR sounds good to me. No comments on your actual implementation yet, but I think we should do it in a way to be re-usable by stmt_kills_ref_p. Richard. > Thanks, > Simon > > +/* Return true if p and q reference the same location by the same name but > + through VAR_DECL and MEM_REF. */ > + > +static bool > +mem_locations_match_p (const struct mem_attrs *p, const struct > +mem_attrs *q) { > + HOST_WIDE_INT var_offset; > + tree var, memref; > + > + if (p->expr == NULL_TREE || q->expr == NULL_TREE) > +return false; > + > + if (TREE_CODE (p->expr) == MEM_REF && TREE_CODE (q->expr) == VAR_DECL) > +{ > + var = q->expr; > + var_offset = q->offset; > + memref = p->expr; > +} > + else if (TREE_CODE (q->expr) == MEM_REF && TREE_CODE (p->expr) == VAR_DECL) > +{ > + var = p->expr; > + var_offset = p->offset; > + memref = q->expr; > +} > + else > +return false; > + > + if (TREE_OPERAND (TREE_OPERAND (memref, 0), 0) != var) > +return false; > + > + if (TREE_TYPE (TREE_TYPE (var)) != TREE_TYPE (memref)) > +return false; > + > + tree offset = TREE_OPERAND (memref, 1); if ((TREE_CODE (offset) == > + INTEGER_CST && tree_fits_shwi_p (offset) > + && tree_to_shwi (offset) == var_offset) > + || offset == NULL_TREE && var_offset == 0) > +return true; > + > + return false; > + > +} > + > /* Return true if the given memory attributes are equal. */ > > bool > @@ -254,16 +298,16 @@ mem_attrs_eq_p (const struct mem_attrs *p, const struct > mem_attrs *q) > return false; >return (p->alias == q->alias > && p->offset_known_p == q->offset_known_p > - && (!p->offset_known_p || p->offset == q->offset) > && p->size_known_p == q->size_known_p > && (!p->size_known_p || p->size == q->size) > && p->align == q->align > && p->addrspace == q->addrspace > - && (p->expr == q->expr > - || (p->expr != NULL_TREE && q->expr != NULL_TREE > - && operand_equal_p (p->expr, q->expr, 0; > + && (mem_locations_match_p (p, q) > +|| (!p->offset_known_p || p->offset == q->offset) > +&& (p->expr == q->expr > +|| (p->expr != NULL_TREE && q->expr != NULL_TREE > +&& operand_equal_p (p->expr, q->expr, 0); > } gcse.patch Description: gcse.patch